Digitální knihovna UPCE přechází na novou verzi. Omluvte prosím případné komplikace. / The UPCE Digital Library is migrating to a new version. We apologize for any inconvenience.

Publikace:
Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Konferenční objektOmezený přístuppeer-reviewedpublished
Načítá se...
Náhled

Datum

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

IEEE (Institute of Electrical and Electronics Engineers)

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

In the domain of character-based approximate string matching, edit distances such as Levenshtein have remained predominant despite their quadratic time complexity. This reality has prompted the adoption of more efficient metrics like Jaro and Jaro-Winkler. However, these methods often overlook the significance of character order within the matching window, which can adversely affect accuracy.For the first time, we introduce a novel class of character-based approximate string matching algorithms that leverage a convolutional kernel, surpassing the performance of existing state-of-the-art unsupervised character-based approximate string matching algorithms. This paper presents Convolutional Jaro (ConvJ) and Convolutional Jaro-Winkler (ConvJW), innovative similarity metrics designed to overcome these shortcomings. ConvJ and ConvJW utilize a convolutional approach with Gaussian weighting to effectively capture the positional proximity of matching characters, resulting in a more precise similarity evaluation. This method not only achieves computational efficiency comparable to that of Jaro and Jaro-Winkler but also surpasses the state-of-the-art in terms of F1-score, demonstrating faster execution times compared to the conventional Jaro and Jaro-Winkler implementations across various datasets.Our extensive experimental analysis highlights the exceptional performance of ConvJ and ConvJW across a range of datasets. Remarkably, ConvJ exhibits a 7x faster execution time than the fast Jaro implementation and exceeds the state-of-the-art F1-score by a significant margin of 10% more than Jaro. By setting a new benchmark in unsupervised character-based approximate string matching, our research shows the new way for future exploration and development in this field. The ConvJ and ConvJW algorithms, characterized by their quasilinear time complexity and improved accuracy, provide a solid foundation for the advancement of string matching techniques. These developments hold promise for a broad spectrum of applications in data mining, bioinformatics, and related areas.

Popis

Klíčová slova

Computational efficiency, Convolution, Data mining, String searching algorithms, Výpočetní účinnost, Konvoluce, Dolování dat, Algoritmy pro vyhledávání řetězců

Citace

Permanentní identifikátor

Endorsement

Review

Supplemented By

Referenced By