Publikace: BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching
Konferenční objektopen accesspeer-reviewedpostprintNačítá se...
Datum
Název časopisu
ISSN časopisu
Název svazku
Nakladatel
Springer Nature Switzerland AG
Abstrakt
Set similarity join, crucial for data cleaning, integration, and recommendation systems, identifies set pairs exceeding a similarity threshold. Our approach combines a count Q-gram filter with maximum weighted bipartite matching, balancing accuracy and efficiency. The Qgram filter, based on the relationship between Q-gram similarity and edit distance, reduces the number of comparisons, operating in constant time on a pre-built index. This enables real-time processing, as only a minimal number of pairs are verified through Fuzzy Bipartite Matching, significantly enhancing the efficiency of similarity joins.
Popis
Klíčová slova
similarity join, Q-gram filter, record linkage, entity resolution, similarity space, bipartite matching, podobnostní spojení, Q-gramový filtr, propojení záznamů, rozlišení entit, prostor podobnosti, bipartitní párování