Dr. Dong Deng
, Massachusetts Institute of Technology
Abstract: In this talk, I will discuss the similarity join problem, which, given a collection of strings (or sets), finds all the similar string (or set) pairs in the dataset under a specific similarity function. Similarity join plays an essential role in many applications, such as personalized recommendations and collaborative filtering, entity resolution, near-duplicate detection, data cleaning, data integration, and machine learning. The similarity functions are used to measure the similarity between two strings (or sets). Two strings (or sets) are said to be similar if and only if their similarity exceeds a given threshold. In this talk, we will cover the set-based similarity functions, both normalized ones (e.g., Jaccard similarity) and unnormalized ones (e.g., overlap size). We will also cover the sequence-based similarity functions (e.g., Edit Distance and Edit Similarity). While most of existing methods utilize the prefix-filter based framework, we will present two dramatically different ways to address these problems. As we will show in this talk, our proposed methods outperform the state-of-the-arts both in practice and in theory.
Biography: Dong Deng is a postdoctoral associate in the Database Group at MIT CSAIL where he works with Mike Stonebraker and Sam Madden on the Data Civilizer warehouse curation system. His research interests include data integration and data cleaning, programming synthesis, data management, database systems, and data usability, with a focus on developing end-to-end systems and proposing scalable algorithms to manage real-world data, including but not limited to: relational data, open data, web tables, enterprise data, data warehouse, logs, heterogeneous data, and knowledge bases. He received his Ph.D. from Tsinghua University with the highest dissertation award. He is a recipient of the prestige Siebel Scholarship, Microsoft PhD Fellowship, Google PhD Fellowship, Intel Scholarship, and Boeing Scholarship.
Host: Renée J. Miller