Speaker: Stephan Ewan
Technical University, Berlin
Title: Spinning Fast Iterative Data Flows and an Overview of Stratosphere
Parallel data flow systems (like databases or MapReduce systems) are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel data flow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory.
We propose a method to integrate "incremental iterations", a form of workset iterations, with parallel data flows. After showing how to integrate bulk iterations into a data flow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in data flows and allows for exploiting the "sparse computational dependencies" inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved data flow system is highly ompetitive with specialized systems while maintaining a transparent and unified data flow abstraction. In addition to the discussion of iterative algorithms, the talk will give an overview of the Stratosphere Query Processor, our current parallel runtime for data analytical tasks. We will talk about programming abstractions, UDF optimization and parallel execution.
Stephan Ewen is a research associate at the department for Database Systems and Information Management
(DIMA) at the Berlin University of Technology. He is working on the Stratosphere Project that aims at creating a versatile and efficient analytics engine for deep analysis of Big Data on cloud platforms. Within the project, Stephan works on the system's data flow programming abstraction, the data flow optimization and the parallel runtime system. Prior to joining the DIMA group, Stephan completed the “Applied Computer Science” program at the University of Cooperative Education Stuttgart jointly with IBM Germany Ltd and got his Diploma from the University of Stuttgart. In the course of his studies, Stephan Ewen worked, among others, for the IBM Almaden Research Centre and the IBM Development Laboratory Böblingen.