Speaker: Jarek Szlichta, Department of Computer Science, U of T
Title: "Bringing Order to Big Data"
Understanding the semantics of data is important for optimization of queries for business intelligence and data quality analysis. In this talk, we will present our holistic and extensible business intelligence and data cleaning techniques that help to improve data analysis and data quality. Poor data quality is a barrier to effective, high-quality decision making based on data. Current data cleaning techniques apply mostly to traditional enterprise data rather that to big data, which is not only large but also more dynamic and heterogeneous. Declarative data cleaning encodes data semantics as constraints (rules) and errors arise when the data violates the constraints. Declarative data cleaning has emerged as an effective tool for both assessing and improving the quality of data. We have proposed a continuous data cleaning framework that can be applied to dynamic data. Our approach permits both the data and its semantics to evolve and suggests repairs based on the accumulated evidence as statistics. We built a classifier that predicts types of repairs needed (data repair, constraint repair, or hybrid of both) to resolve an inconsistency, and learns from past user repair preferences to recommend more accurate repairs in the future. As business intelligence applications have become more complex and as data volumes have grown, the analytic queries needed to support these applications have become more complex too. The increasing complexity raises performance issues and numerous challenges for query optimization. We introduced order dependencies (ODs). (ODs capture monotonicity properties in the data.) Our main goal is to investigate the inference problem for ODs, both in theory and in practice. We have developed query optimization techniques using ODs for business intelligence queries over data warehouses. These operations and techniques we have implemented in IBM DB2 engine. We have shown how ODs can be used to improve the performance of real and benchmark analysis queries (providing an average 50% speed up).
Jarek Szlichta is an Assistant Professor at University of Ontario (from July 2014) and has been a Postdoctoral Fellow at University of Toronto working with Professor RenA(C)e Miller (2013-2014). His research concerns big data, business intelligence, data analytics, information integration, heterogeneous computing, systems, web search and machine learning. He received doctoral degree from York University (2009-2013). During that time he spent a 3-year fellowship at IBM Centre for Advanced Studies in Toronto. His research at IBM includes optimization of queries for business intelligence, and its focus is on order dependencies. He is a recipient of IBM Research Student-of-the-Year award (2012) "for having insights and perspective that has significantly contributed to IBM in a matter of great importance". Previously he worked at Comarch Research & Development on designing and implementing OCEAN GenRap system, which is an innovative data analytics reporting solution. This work was recognized by receiving the prestigious CeBIT Business Award (2007). For a list of publications, please visit Jarek's web page: www.cs.toronto.edu/~szlichta/publications.html
For additional information, contact: Jarek Szlichta