access_time 04 de fevereiro de 2020 às 11:00 até 04 de fevereiro de 2020 às 12:00
place Room 0.17, Pavilhão de Informática II, IST, Alameda
Duplicate detection concerns with identifying pairs of attributes/records that refer to the same real-world object, thus corresponding to a fundamental process when ensuring data quality in databases. Existing methods to detect duplicate attributes can leverage heuristic string similarity measures based on characters or small character sequences, phonetic encoding techniques that match strings based on the way they sound, or hybrid techniques that combine different approaches. However, these methods all rely on common sub-strings in order to establish similarity, and they often do not effectively capture the character replacements involved in duplicate attributes due to transliterations or the use of different languages and/or alphabets. This work follows on a recent proposal by Santos et al. regarding string matching using deep neural networks, which tackles the aforementioned challenges by leveraging recurrent neural units for modeling sequential data and building semantic representations for the input strings. We consider several alternative neural architectures, e.g. combining recurrent units with attention or pooling operations, or based on the Transformer model. The different approaches are evaluated using datasets describing collections of person names, organizations, or geographical locations. The obtained results show that the neural models can achieve superior results on all datasets, when compared to standard string similarity measures and without the need of major tunings on the network parameters. Models trained on a specifc domain were also shown to be able to generalize on other domains (e.g., models trained on a dataset composed of person names are able to perform competitively when evaluated on pairs of organization names), although still having signifcant impacts on performance.
local_offer Tópicos de Investigação
person Candidato: Luís Pedro Pires Borges
supervisor_account Orientador: Prof. Bruno Emanuel da Graça Martins