Neural Methods for Approximate String Matching and Duplicate Detection

access_time February 04, 2020 at 11:00AM until February 04, 2020 at 12:00PM
place Room 0.17, Pavilhão de Informática II, IST, Alameda

Duplicate detection concerns with identifying pairs of attributes/records that refer to the same real-world object, thus corresponding to a fundamental process when ensuring data quality in databases. Existing methods to detect duplicate attributes can leverage heuristic string similarity measures based on characters or small character sequences, phonetic encoding techniques that match strings based on the way they sound, or hybrid techniques that combine different approaches. However, these methods all rely on common sub-strings in order to establish similarity, and they often do not effectively capture the character replacements involved in duplicate attributes due to transliterations or the use of different languages and/or alphabets. This work follows on a recent proposal by Santos et al. regarding string matching using deep neural networks, which tackles the aforementioned challenges by leveraging recurrent neural units for modeling sequential data and building semantic representations for the input strings. We consider several alternative neural architectures, e.g. combining recurrent units with attention or pooling operations, or based on the Transformer model. The different approaches are evaluated using datasets describing collections of person names, organizations, or geographical locations. The obtained results show that the neural models can achieve superior results on all datasets, when compared to standard string similarity measures and without the need of major tunings on the network parameters. Models trained on a specifc domain were also shown to be able to generalize on other domains (e.g., models trained on a dataset composed of person names are able to perform competitively when evaluated on pairs of organization names), although still having signifcant impacts on performance.

local_offer Research topics
person Candidate: Luís Pedro Pires Borges
supervisor_account Advisor: Prof. Bruno Emanuel da Graça Martins