Towards Effective and Effortless Data Cleaning: from Automatic Approaches to User Involvement

access_time 04 de dezembro de 2019 às 09:30 até 04 de dezembro de 2019 às 11:30
place Sala 0.17, Pavilhão de Informática II, IST, Alameda

In recent years, due to fast data spreading and low cheap sensors, data has been created in a very fast pace. Data is continuously produced, scrutinized and increasingly used to make important decisions. Automated recommendations and decision-making based on these data can lead to enormous societal benefits. Conversely, the existence of large amounts of data increases the probability of occurring data quality problems with negative impact on data-based decisions. Data quality problems can be errors, missing values, values with doubtful meaning, duplicates and inconsistencies. A data cleaning process plays an important role in correcting these problems. The existence of acronyms with no expansion in text is considered as a data quality problem and it has been lacking attention from the data cleaning community. In fact, there is no acronym expansion software system available that can automatically find the expansions for the acronyms found in textual documents. Furthermore, in cases where an acronym has more than one expansion available, the criterion applied to select the right expansion is usually the cosine similarity between a representation of the document containing the acronym and a representation of each document containing an expansion. We claim that selecting the right expansion can be improved with other term and document representations and machine learning techniques that replace the cosine similarity. A data cleaning process is usually an iterative process because it may need to be repeatedly executed and refined in order to be able to produce the highest possible data quality. Moreover, due to the specificity of some data quality problems and the limitation of data quality rules to cover all data cleaning problems, often a user has to be actively involved in the execution of a data cleaning program by manually repairing data. However, there is no framework that supports the user involvement in such iterative data cleaning process. Moreover, tools used for data cleaning that somehow involve the user in the process have not been evaluated with real users to access the user effort when designing data cleaning programs and manually repairing data. Data cleaning processes often require the user intervention to help to clean data through manual data repairs. Although manual data repairing is a tedious task and infeasible to perform on large datasets, it clearly improves the data quality in a data cleaning process. Therefore, in order to achieve high levels of quality in large datasets, it is essential to provide automatic solutions that can efficiently replace the user involvement when manually repairing data. In this PhD, we will work on new approaches that provide an effective and effortless data cleaning process. In particular, we propose to: (i) build a acronym expansion system with novel disambiguation approaches; (ii) improve a data cleaning framework with support for user involvement during an iterative data cleaning process and conduct an experimental comparison of tools used for data cleaning with real and simulated users; and (iii) develop a data cleaning framework with automatic approaches that learn and further replace the user involvement when manually correcting data.

local_offer Prova de CAT
person Candidato: João Pedro Lebre Magalhães Pereira
supervisor_account Orientador: Prof.ª Helena Isabel de Jesus Galhardas