Dr. Raul Castro Fernandez
Title: Data Discovery
Organizations face a data discovery problem when their analysts spend more time looking for relevant data than analyzing it. This problem has become commonplace in modern organizations as: i) data is stored across multiple storage systems, from databases to data lakes; ii) data scientists do not operate within the limits of well-defined schemas or a small number of data sources instead, to answer complex questions they must access data spread across thousands of data sources.
To address this problem we have built AURUM, a system to tackle data discovery problems. AURUM introduces a new discovery algebra, called the Source Retrieval Query Language (SRQL), that lets users declaratively search for relevant data sources through a set of primitives that expose the relations of the underlying data; examples of such relations are syntactic, such as content similarity or attribute similarity, but also semantic, for which we use a new technique to discover relevant links between attributes using different forms of knowledge representations, such as ontologies, knowledge bases, or human input. Supported operations of the algebra include keyword search on schemas and values as well as relatedness operations on pairs of attributes, groups of tables, and join paths. The algebra utilizes an enterprise knowledge graph (EKG) to answer queries with sub-second latencies. AURUM is scalable: it builds the EKG in linear time, despite the complexity of extracting all relationships among thousands of sources.
In the talk I ll motivate the problem of data discovery, explain the technical details and decisions we made while designing Aurum, and give details on how different organizations use and benefit from the system.
I m a postdoctoral researcher at the database group at MIT, working with professors Sam Madden and Mike Stonebraker. At MIT I m working on approaches to facilitate how people search and identify relevant data. Before MIT, I completed my PhD at Imperial College London, under the supervision of Peter Pietzuch. In my PhD I focused on building systems to process large amounts of data efficiently. In particular, I worked on stateful data-parallel processing, which has been used to implement scalable deep network training pipelines,recommendation systems, etc. Broadly, I m interested in building systems and design abstractions to help people work with data more efficiently.
For additional information contact Eric Zhu