Title: Data Integration in the Cloud
Speaker: Andreas Thor, University of Maryland
BA 7256, July 14th noon
Abstract: Cloud computing has become a popular paradigm for efficiently processing computationally and data-intensive tasks. Such tasks can be executed on demand on powerful distributed hardware and service infrastructures. The parallel execution of complex tasks is facilitated by different programming models (e.g., MapReduce), distributed data stores, and the ability to employ computing capacity on demand. Data integration can notably benefit from cloud computing because accessing multiple data sources and integration of instance data are usually expensive tasks.
In the first part of the talk we introduce CloudFuice, a data integration system that follows a mashup-like specification of advanced data flows for data integration. CloudFuice's task-based execution approach allows for an efficient, asynchronous, and parallel execution of data flows in the cloud and utilizes recent cloud-based web engineering instruments. The second part of the talk deals with the effectiveness and scalability of MapReduce-based implementations for entity resolution. In the presence of skewed data, sophisticated redistribution approaches become necessary to achieve load balancing among all reduce tasks to be executed in parallel. The proposed approaches support blocking techniques to reduce the search space of entity resolution and effectively distribute the entities of large blocks among multiple reduce tasks.
Bio: Andreas Thor received a Diploma and a Ph.D. in Computer Science in 2002 and 2008, respectively, from the University of Leipzig, Germany. He holds an appointment as Research Scientist with the database group in Leipzig. Andreas is currently a visiting research scientist at University of Maryland Institute for Advanced Computer Studies. Andreas' research areas deal with integration of web data sources. More specifically, he has been working on approaches for entity resolution, ontology alignment, and flexible integration architectures.