Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Rozinek, Ondřej; Mareš, Jan

doi:10.23919/FRUCT61870.2024.10516360

Publikace:
Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Konferenční objektOmezený přístuppeer-reviewedpublished

Soubory

fast_and_precise.pdf (1.07 MB)

Datum

2024

Autoři

Rozinek, Ondřej

Mareš, Jan

Nakladatel

IEEE (Institute of Electrical and Electronics Engineers)

Abstrakt

In the domain of character-based approximate string matching, edit distances such as Levenshtein have remained predominant despite their quadratic time complexity. This reality has prompted the adoption of more efficient metrics like Jaro and Jaro-Winkler. However, these methods often overlook the significance of character order within the matching window, which can adversely affect accuracy.For the first time, we introduce a novel class of character-based approximate string matching algorithms that leverage a convolutional kernel, surpassing the performance of existing state-of-the-art unsupervised character-based approximate string matching algorithms. This paper presents Convolutional Jaro (ConvJ) and Convolutional Jaro-Winkler (ConvJW), innovative similarity metrics designed to overcome these shortcomings. ConvJ and ConvJW utilize a convolutional approach with Gaussian weighting to effectively capture the positional proximity of matching characters, resulting in a more precise similarity evaluation. This method not only achieves computational efficiency comparable to that of Jaro and Jaro-Winkler but also surpasses the state-of-the-art in terms of F1-score, demonstrating faster execution times compared to the conventional Jaro and Jaro-Winkler implementations across various datasets.Our extensive experimental analysis highlights the exceptional performance of ConvJ and ConvJW across a range of datasets. Remarkably, ConvJ exhibits a 7x faster execution time than the fast Jaro implementation and exceeds the state-of-the-art F1-score by a significant margin of 10% more than Jaro. By setting a new benchmark in unsupervised character-based approximate string matching, our research shows the new way for future exploration and development in this field. The ConvJ and ConvJW algorithms, characterized by their quasilinear time complexity and improved accuracy, provide a solid foundation for the advancement of string matching techniques. These developments hold promise for a broad spectrum of applications in data mining, bioinformatics, and related areas.

Klíčová slova

Computational efficiency, Convolution, Data mining, String searching algorithms, Výpočetní účinnost, Konvoluce, Dolování dat, Algoritmy pro vyhledávání řetězců

Permanentní identifikátor

https://hdl.handle.net/10195/85989

Kolekce

Publikační činnost akademických pracovníků UPCE / UPCE Research Outputs
Publikační činnost akademických pracovníků FEI / FEI Research Outputs

Zobrazit úplný záznam

Publikace:
Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Soubory

Datum

Autoři

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

Popis

Klíčová slova

Citace

Permanentní identifikátor

Kolekce

Endorsement

Review

Supplemented By

Referenced By

Publikace: Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Soubory

Datum

Autoři

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

Popis

Klíčová slova

Citace

Permanentní identifikátor

Kolekce

Endorsement

Review

Supplemented By

Referenced By

Publikace:
Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity