Towards Effective and Effortless Data Cleaning: from Automatic Approaches to User Involvement
In recent years, due to fast data spreading and low cheap sensors, data has been created in a very fast pace. Data is continuously produced, scrutinized and increasingly used to make important decisions. Automated recommendations and decision-making based on these data can lead to enormous societal benefits. Conversely, the existence of large amounts of data increases the probability of occurring data quality problems with negative impact on data-based decisions. Data quality problems can be errors, missing values, values with doubtful meaning, duplicates and inconsistencies. A data cleaning process plays an important role in correcting these problems. The existence of acronyms with no expansion in text is considered as a data quality problem and it has been lacking attention from the data cleaning community. In fact, there is no acronym expansion software system available that can automatically find the expansions for the acronyms found in textual documents. Furthermore, in cases where an acronym has more than one expansion available, the criterion applied to select the right expansion is usually the cosine similarity between a representation of the document containing the acronym and a representation of each document containing an expansion. We claim that selecting the right expansion can be improved with other term and document representations and machine learning techniques that replace the cosine similarity. A data cleaning process is usually an iterative process because it may need to be repeatedly executed and refined in order to be able to produce the highest possible data quality. Moreover, due to the specificity of some data quality problems and the limitation of data quality rules to cover all data cleaning problems, often a user has to be actively involved in the execution of a data cleaning program by manually repairing data. However, there is no framework that supports the user involvement in such iterative data cleaning process. Moreover, tools used for data cleaning that somehow involve the user in the process have not been evaluated with real users to access the user effort when designing data cleaning programs and manually repairing data. Data cleaning processes often require the user intervention to help to clean data through manual data repairs. Although manual data repairing is a tedious task and infeasible to perform on large datasets, it clearly improves the data quality in a data cleaning process. Therefore, in order to achieve high levels of quality in large datasets, it is essential to provide automatic solutions that can efficiently replace the user involvement when manually repairing data. In this PhD, we will work on new approaches that provide an effective and effortless data cleaning process. In particular, we propose to: (i) build a acronym expansion system with novel disambiguation approaches; (ii) improve a data cleaning framework with support for user involvement during an iterative data cleaning process and conduct an experimental comparison of tools used for data cleaning with real and simulated users; and (iii) develop a data cleaning framework with automatic approaches that learn and further replace the user involvement when manually correcting data.
access_time December 04, 2019 at 09:30AM
place Sala 0.17, Pavilhão de Informática II, IST, Alameda
local_offer CAT exam
person Candidate: João Pedro Lebre Magalhães Pereira
supervisor_account Advisor: Prof.ª Helena Isabel de Jesus Galhardas
LM2F: A Life-Cycle Model Maintenance Framework for Co-Evolving Enterprise A. Meta-Models and Models
Enterprise Architecture (EA) models are tools that capture the concepts and relationships that together describe the different enterprise domains. An EA meta-model defines a set of constructs and rules that explicitly describe how to build an EA model. Over time, to keep up with the need to capture a more complex reality in their EA model, enterprises need to evolve their EA meta-model and, consequently, update the existing EA model. However, the task of maintaining the EA model conformance, when the EA meta-model changes, is time-consuming and error-prone due to the complexity of the EA model and the temporal dependencies amongst the model entities. Furthermore, despite the existing EA research efforts on EA model maintenance automation and in the model-driven engineering field regarding metamodel and model co-evolution, there is still a gap in the state-of-the-art towards uniting both lines of research, thus addressing the research problem. In this thesis, we present the Life-Cycle Model Maintenance Framework (LM2F) for co-evolving EA models driven by a set of changes to the EA meta-model. LM2F was developed, applied, and validated using the Design-Science Research method. The framework is composed of two building blocks. The first building block specifies a life-cycle temporal pattern based on the principles of life-cycle modelling. The second building block presents a catalogue of operators that the modeller can use to update the EA model automatically. LM2F has the goal of reducing the modelling time required to update the EA model when changes to the EA meta-model occur and the error-proneness associated to manual modelling while enabling temporal model analysis by providing past, current, and future snapshots of the enterprise’s EA due to its life-cycle property. We developed an implementation of LM2F as a combination of two prototypes: a standalone user interface and a library integrated within a proprietary EA management tool. We applied LM2F to two use cases: a co-evolution scenario using the EA of a fictitious ArchiSurance enterprise and a co-evolution scenario using the EA of a real Portuguese organisation in the energy industry. The framework was validated using an approach for qualitative evaluation in information systems.
access_time November 19, 2019 at 02:30PM
place Anfiteatro PA-3 (Piso -1, Pavilhão de Matemática), IST, Alameda
local_offer Doctoral exam
person Candidate: Nuno Miguel Carvalho da Silva
supervisor_account Advisor: Prof. Miguel Mira da Silva / Prof. Pedro Manuel Moreira Vaz Antunes de Sousa
Assessing Enterprise Governance of Information Technology using Multiple Reference Models
Enterprises are increasingly making tangible and intangible investments in improving the Enterprise Governance of IT (EGIT). In support of this, enterprises are drawing upon the practical relevance of generally accepted good-practice models, hereafter called Reference Models. Approximately 315 EGIT Reference Models have been identified, and the number of these models has now increased, as have their application areas. However, the implementation of any of these models requires specific experience, knowledge, and resources, along with a high degree of effort and investment. Therefore, although compelling in theory, EGIT Reference Models can be challenging to implement in practice. As a result, while many enterprises have recognized the importance of EGIT Reference Models many have yet to implement them. Moreover, none of the EGIT Reference Models meet all the requirements that an organization needs to satisfy to benchmark the organizational adherence to different regulations. As such, organizations need to select and implement processes from different EGIT Reference Models, and so, interoperability between different EGIT Reference Models is subsequently required. From the literature, we found and selected four research challenges to be addressed in this thesis. These research challenges were subsequently validated in practice. The research challenges follow next: • There is a lack of theoretical foundation regarding EGIT Reference Models that allows a varied interpretation of the models and leads to a lack of agreement, acceptance, and understanding of EGIT models due to its perceived complexity. • There is a lack of a comprehensive approach for integrating EGIT Models, and so, it is difficult to perform a simultaneous process assessment of multiple Reference Models. • There is a lack of a method to perform cost-effective process assessments in multi-models environments, and so, process assessments are costly and time-consuming. • What are the characteristics of an EGIT organizational process maturity model, that is aligned with the Reference Models for EGIT and is compliant with the ISO/IEC 33000 family of standards? Using the design science research methodology as the main research methodology, several artifacts were designed, developed, demonstrated, and evaluated. To address the first research challenge, we propose the use of modeling techniques to represent EGIT Reference Models as conceptual metamodels, enabling in that way a better understanding of the main concepts of the model and their relations since these models can learn from a rigid formalization and a systematic approach. To address the second research challenge, two different approaches are proposed: in the first approach, we also propose the use of modeling techniques to map and integrate different EGIT Reference Models. In the second one, we propose an approach that through semantic similarity techniques compares process assessment core concepts of different Reference Models. To address the third research challenge, we propose the development of an artifact in the form of a method that facilitates the selection and assessment of the processes by organizations in multimodels environments. The method was then instantiated in a software tool. Finally, in order to address the fourth research challenge, we propose an Organizational Process Maturity Model for EGIT based on the COBIT 5 PAM and compliant with the ISO/IEC 330xx family that allows organizations to assess their overall process maturity level, and improve their controls and governance practices. Ali the proposed artifacts can work in a standalone way to solve each research challenge defined, or they can be used together to perform a more robust process assessment as it will be explained in this document. The evaluation of the different artifacts is grounded in a combination of several methods including semi-structured interviews.
access_time October 21, 2019 at 02:00PM
place Sala 4.41 (Piso 2, Pavilhão de Civil), IST, Alameda
local_offer Doctoral exam
person Candidate: Rafael Saraiva de Almeida
supervisor_account Advisor: Prof. Miguel Leitão Bignolas Mira da Silva
Bandwidth-aware page placement in asymmetric NUMA systems
Page placement is a critical problem to memory-intensive applications running on nonuniform memory access (NUMA) architectures. However, modern NUMA systems present complex memory and interconnect topologies, characterized by asymmetric bandwidths and latencies, and sensitive to memory contention and interconnect congestion interferences. In this thesis, we show that the most common rule of thumb to page placement in NUMA systems fails to exploit the available memory bandwidth by substantial margins. We propose BWAP, a novel page placement mechanism that, unlike current state-of-the-art alternatives, is based on the principle of asymmetric weighted interleaving for optimizing memory access performance. Given a target application running on a subset of nodes in a NUMA machine, BWAP estimates a near-optimal weight distribution and places the shared pages of that application accordingly.
access_time October 21, 2019 at 10:30AM
place Sala 0.20, Pavilhão Informática II, IST, Alameda
local_offer CAT exam
person Candidate: David Daharewa Gureya
supervisor_account Advisor: Prof. João Pedro Faria Mendonça Barreto/ Prof. Vladimir Vlassov
Improved Correctness and Scalability for Blockchains
Blockchain cryptocurrencies such as Ethereum offer a secure and decentralized transaction system and have the potential to replace legacy financial transaction systems. Despite their potential, they suffer from transaction ordering and admission problems. These stem from having miners deciding the transaction execution order, as well as which transactions are admitted in the blockchain. Transaction censorship, transaction removal due to double-spending attacks and long transaction commit delays are some of the resulting problems. In this dissertation, we will propose new algorithms to mitigate these fundamental problems, resorting to principles and techniques from epidemic broadcast algorithms and weakly-consistent replication.
access_time September 25, 2019 at 02:00PM
place Sala C10, Pavilhão Central (Piso 1), IST, Alameda
local_offer CAT exam
person Candidate: Paulo Jorge Raposo Duarte Adrião Mendes da Silva
supervisor_account Advisor: Prof. João Pedro Faria Mendonça Barreto/ Prof. Miguel Ângelo Marques de Matos