Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

Rozinek, Ondřej; Borkovcová, Monika; Mareš, Jan

doi:10.1007/978-3-031-60328-0_18

Publikace:
Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

Konferenční objektopen accesspeer-reviewedpostprint

Soubory

Datum

2024

Autoři

Rozinek, Ondřej

Borkovcová, Monika

Mareš, Jan

Nakladatel

Springer Nature Switzerland AG

Abstrakt

Record linkage is the process of matching records from multiple data sources that refer to the same entities. When applied to a single data source, this process is known as deduplication. With the increasing size of data source, recently referred to as big data, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent decades, several blocking, indexing and filtering techniques have been developed. Their purpose is to reduce the number of record pairs to be compared by removing obvious non-matching pairs in the deduplication process, while maintaining high quality of matching. Currently developed algorithms and traditional techniques are not efficient, using methods that still lose significant proportion of true matches when removing comparison pairs. This paper proposes more efficient algorithms for removing non-matching pairs, with an explicitly proven mathematical lower bound on recently used stateof-the-art approximate string matching method - Fuzzy Jaccard Similarity. The algorithm is also much more efficient in classification using Density-based spatial clustering of applications with noise (DBSCAN) in log-linear time complexity O(|E| log(|E|)).

Klíčová slova

record deduplication, Q-gram filter, record linkage, entity resolution, similarity space, bipartite matching, similarity join, deduplikace záznamů, filtr Q-gramů, propojení záznamů, entita rozlišení, podobnostní prostor, bipartitní párování, podobnostní spojení

Permanentní identifikátor

https://hdl.handle.net/10195/85970

Kolekce

Publikační činnost akademických pracovníků UPCE / UPCE Research Outputs
Publikační činnost akademických pracovníků FEI / FEI Research Outputs

Zobrazit úplný záznam

Publikace:
Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

Soubory

Datum

Autoři

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

Popis

Klíčová slova

Citace

Permanentní identifikátor

Kolekce

Endorsement

Review

Supplemented By

Referenced By

Publikace: Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

Soubory

Datum

Autoři

Název časopisu

ISSN časopisu

Název svazku

Nakladatel

Výzkumné projekty

Organizační jednotky

Číslo časopisu

Abstrakt

Popis

Klíčová slova

Citace

Permanentní identifikátor

Kolekce

Endorsement

Review

Supplemented By

Referenced By

Publikace:
Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data