Workshop on Distributed and Stream Data Processing and Machine Learning at Smart Data Forum Berlin

The workshopDistributed and Stream Data Processing and Machine Learningby Berlin Big Data Center (BBDC) was held on 22 March 2019 at the Smart Data Forum (SDF) showroom in Berlin Charlottenburg. An audience of about 70 people came to hear four insightful presentations on data processing and machine learning.

The organizer of the workshop Prof. Dr. Volker Markl welcomed the guests and spoke about the latest research of the BBDC as well of the both research groups “Database Systems and Information Management” (DIMA) at TUB and “Intelligent Analytics for Massive Data” (IAM) at DFKI. He presented with a focus on a technical insight into systems the best research papers that had been published and awarded in prestigious conferences during the last years. Markl also mentioned that the origins of the open source big data analysis framework Apache Flink can be traced back to DIMA, and the startup data Artisans (now Ververica), as a spin off from DIMA, TU Berlin, who build their business on it, is very successfully commercialized and it is now the time to move on to develop new systems in a different environment.

Albert Bifet, Professor at Telecom ParisTech, head of the Data, Intelligence and Graphs Group (DIG) as well as research associate at the WEKA Machine Learning Group, gave a speech on Machine Learning for Data Streams. According to him, data availability and computational scale has increased substantially in the last years, therefore, machine learning is gaining relevance and is developing faster than ever before. Prof. Bifet also referred to the French AI strategy discussing some key challenges. He sees one priority of the French AI strategy in the Green AI, for which he suggests transferring big data into small data in order to make data stream methods more efficient. Moreover, he addressed the challenges of explainable AI as well as ethical issues arising from these technological developments.

Almar El Abbadi, Professor of Computer Science at the University of California in Santa Barbara, continued with a talk on The Cloud, the Edge and Blockchains: Unifying Them. He spoke about how to manage large data sets, focusing mainly on scalability, availability and fault tolerance as well as consistency of data that is located in different places. The professor argued that the fields of distributed systems, data management as well as cryptography need to be further integrated in order to better work on issues like throughput and latency. El Abbadi also explained the advantages and limitations of blockchains from a data base and distributed computing perspective.

Seif Harifi and Paris Carbone concluded the workshop by presenting the topic From Stream Processing to Continuous and Deep Analytics. Both are among the main developers of Apache Flink and they introduced their new projects and visions for the future, in particular

the challenges of going from distributed stream processing to

continuous deep analytics (CDA).

Seif Harifi, Chief Scientific Advisor of RISE SICS and Chair-Professor of Computer Systems at KTH Royal Institute of Technology Stockholm, talked about the current research projects of his team: the startup Logical Clocks develops Hopsworks, which is a key project at RISE SICS.

Paris Carbone, senior researcher at the Swedish Institute of Computer Science, which is associated with RISE, presented their current research on CDA. The mission is to design systems that allow to go efficiently from data to decision making. The project aims at providing a unified approach to declare and execute analytical tasks and their seamless integration with continuous services, streams and data-driven applications at scale. This is especially relevant as there are more data centric applications than ever before, such as relational data streams, dynamic graphs, simulation tasks, feature learning and many more. Additionally, he explained Arc, the language used to capture batch and stream analytics as well as a sophisticated distributed runtime.

Program

08:30Meet and coffee
09:00Introduction by Volker Markl
09:30 Talk by Albert Bifet “Machine Learning for Data Streams“ and Discussion
10:30Talk by Amr El Abbadi ”The Cloud, the Edge and Blockchains: Unifying Themes and Challenges” and Discussion
11:30Talk by Seif Haridi/Paris Carbone “From Stream Processing to Continuous and Deep Analytics” and Discussion

Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co-Director of the Berlin Machine Learning Center (BZMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.

Abstract: Big Data and the Internet of Things (IoT) have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, and I will introduce some popular open source tools for data stream mining.

Bio: Albert Bifet is Professor at Telecom ParisTech, Head of  the Data, Intelligence and Graphs (DIG) Group, and Honorary Research Associate at the WEKA Machine Learning Group at University of Waikato. Previously he worked at Huawei Noah's Ark Lab in Hong Kong, Yahoo Labs in Barcelona, University of Waikato and UPC BarcelonaTech. He is the co-author of a book on Machine Learning from Data Streams. He is one of the leaders of MOA and Apache SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams. He was serving as Co-Chair of the Industrial track of IEEE MDM 2016, ECML PKDD 2015, and as Co-Chair of BigMine (2018-2012), and A CM SAC Data Streams Track (2019-2012).

Abstract: Significant paradigm shifts are occurring in the way data is accessed and updated. Data is “very big” and distributed across the globe. Access patterns are widely dispersed and large scale analysis requires real- time responses. Many of the fundamental challenges have been studied and explored by both the distributed systems and the database communities for decades. However, the current changing and scalable setting often requires a rethinking of basic assumptions and premises. The rise of the cloud computing paradigm with its global reach has resulted in novel approaches to integrate traditional concepts in novel guises to solve fault-tolerance and scalability challenges. This is especially the case when users require real-time global access. Exploiting edge cloud resources becomes critical for improved performance, which requires a reevaluation of many paradigms, even for a traditional problem like caching. The need for transparency and accessibility has led to innovative ways for managing large scale replicated logs and ledgers, giving rise to blockchains and their many applications. In this talk we will be explore some of these new trends while emphasizing the novel challenges they raise from both distributed systems as well as database points of view. We will propose a unifying framework for traditional consensus and commitment protocols, and discuss novel protocols that exploit edge computing resources to enhance performance. We will highlight the advantages and discuss the limitations of blockchains. Our overall goal is to explore approaches that unite and exploit many of the significant efforts made in distributed systems and databases to address the novel and pressing needs of today’s global computing infrastructure.

Bio: Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow.  He was Chair of the Computer Science Department at UCSB from 2007 to 2011.  He has served as a journal editor for several database journals, including, The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.

Abstract: Contemporary end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensors and graphs. For each of these types of workloads exist several frontends (e.g., SQL, Beam, Tensorflow etc.) exposed in different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that optimise for a respective frontend and possibly a hardware architecture (e.g., GPUs). The resulting pipelines suffer in terms of complexity and performance due to excessive type conversions, materialization of intermediate results and lack of cross-frontend computation sharing capabilities.

In this talk we present the Continuous Deep Analytics (CDA) project, the core principles behind it and our past work that influenced to its conception. CDA aims to provide a unified approach to declare and execute analytical tasks across frontend-boundaries as well as enabling their seemless integration with continuous services, streams and data-driven applications at scale. The system achieves that through Arc, an intermediate language that captures batch and stream analytics as well as a sophisticated distributed runtime that combines and augments existing ideas from stream processing, in-memory databases and cluster computing.

Bio Paris Carbone: He is a senior researcher at the Swedish Institute of Computer Science (part of RISE). He holds a PhD in distributed computing from KTH and is one of the core committers for Apache Flink with key contributions to its state management. Paris is currently leading the ​Distributed Computing & Data Science research group at SICS whose interests span several domains of computer science from distributed algorithms and data management to declarative programming support for data analytics and ML.

Bio Seif Haridi: He is the Chief Scientific Advisor of RISE SICS. He is Chair-Professor of Computer Systems specialized in parallel and distributed computing at KTH Royal Institute of Technology, Stockholm, Sweden. He led a European research program on Cloud Computing and Big Data by EIT-Digital between 2010 to 2013, and is a co-founder of a number of start-ups in the area of distributed and cloud computing including HiveStreaming and LogicalClocks.Recent research include contributions to the design of Apache Flink for stream processing, and HOPS a complete platform for data-analytics