The First Berlin Big Data Center Symposium Held in Berlin
Since the BBDC started in October 2014 as a big data competence center in Berlin, funded by the German Federal Ministry of Education and Research (BMBF), the participating research groups accepted the challenge to find solutions for complex, advanced analysis problems on huge amounts of data with new scientific approaches.
The president of the TU Berlin, Prof. Dr. Christian Thomsen opened the event and welcomed the about 70 participants. Mentioning the importance of research in big data, he congratulated the BMBF for the initiative to support big data centers of excellence.
Then Dr. Tilmann Rabl, the scientific coordinator of the BBDC, Prof. Dr. Volker Markl, the director of the BBDC, and Prof. Dr. Klaus-Robert Müller, the co-director presented the work of the BBDC, its research projects, and the BBDC’s success story in detail.
Further guest speakers from ScaDS Dresden/Leipzig, University of Mannheim, and Zalando SE as well as BBDC principal investigators also presented their research and engaged in discussion on the current state of the art in various fields of big data analytics and data management, such as data integration, text analytics, and data mining.
Two Hours of Demonstrations and Networking
The event also hosted demonstrations from all BBDC partners, Technische Universität Berlin (TUB), German Research Center for Artificial Intelligence (DFKI), Zuse Institute Berlin (ZIB), Fritz Haber Institute of the Max Planck Society (MPI/FHI) and Beuth University of Applied Science Berlin. After the talks the participants had the opportunity to get a detailed insight into the work of the BBDC via demonstrators and were able to speak with all researchers.
A Fast Image Retrieval System with Adjustable Objectives
(TU Berlin IDA, TU Berlin IC / Fraunhofer HHI)
In this retrieval system, a user gives his/her preference (simulated by an image classifier) and a query image, then the system quickly returns relevant images that fit the preference and resemble the image query. The balance between the user preference and the image similarity is adjusted at query time, which has been enabled with our new method, called multiple purpose locality sensitive hashing (mpLSH). We are developing more accurate and parallel computing versions, and will apply to other applications, for example, material search with some properties optimized and video recommendation with adjustable weights for sets of features.
Interpretable Compressed Domain Video Annotation
(TU Berlin IC / Fraunhofer HHI, TU Berlin IDA)
Compressed domain human action recognition algorithms are extremely efficient, because they only require a partial decoding of the video bit stream. In this demo, we present an annotation system based on motion vector histograms, a Fisher Vector representation and a linear SVM classifier. With our recently developed LRP technique we visualize what exactly makes the algorithm decide for a particular action class; thus we can identify where and when the important action happens in the video.
Data-local Medical Workflow Execution with XtreemFS
(FU Berlin / ZIB)
We demonstrate an interactive dashboard for tracking and analyzing personal medical data. We show why a continuous collection and analysis of health-related data including very large omics-type data allows a better health management for an individual, as opposed to classic (i.e. more or less sporadic) collection and analysis of such data. Events of medical relevance, e.g. blood-based proteome and genome test results are shown on a timeline with accompanying information and visualizations. The pipeline providing the application with data is implemented using two key building blocks: (1) XtreemFS, a distributed and fault tolerant file system which common medical devices can transparently store their data in, and (2) Apache Flink which can process the very large data-sets in a parallel and distributed fashion. We will demonstrate the concrete workflow and show where the newly developed technology comes into play.
Big Data Text Analytics Platform for Real Time Information Extraction
We will demonstrate a big data text analytics platform, which can process free texts of various genres such as social media posts (e.g., Twitter), news and RSS feeds. This platform is built on top of Apache Flink big data platform. Thus, its linguistic and analytic pipeline can deal with real time textual data by performing segmentation, POS tagging, named entity recognition, entity linking and relation extraction. In cooperation with the two smart data projects SD4M and SDW, funded by BMWi, we have applied our platform to two application domains, namely mobility monitoring and supply chain management.
Crystal Structure Prediction
We present a web-based implementation of a data-analysis tool for the recognition of the similarity among crystal structures and for the prediction of the difference in formation energy among them. The tool gathers the data for the analysis by a flink-based query to the NOMAD Archive, that contains several millions of crystal configurations. The similarity-recognition algorithm, based on descriptors that encode the proper symmetries of a well-behaved physical representation and makes use of linear and non-linear low-dimensional embedding methods, produces a 2-dimensional map that assigns to separate regions perfect and distorted configurations, for given pairs of crystal structures. The algorithm predicting the difference in formation energies selects the model out of thousands of candidates, by means of a compressed-sensing based method.
Download Slides5 MB
Introducing On-Demand Bandwidth Guarantees to Data Analytics
(TU Berlin INET)
A fundamental challenge of Big Data application is that they are exposed to widely varying data volumes and yet are expected to finish in fixed duration of time. To meet the challenge, they are commonly deployed in shared data centers, where the number of computing units and disks can be scaled freely. Unfortunately, variations in the load across different applications can lead to unpredictable performance of the network, which is shared by all applications.
In this demonstration we show how bandwidth guarantees can improve the runtime of a single Big Data application. In this simple scenario we show, we have deployed the Big Data application together with a cross-traffic generator in a minimal network setup. We have implemented the bandwidth guarantee through the software-defined networking (SDN) technology OpenFlow.
Download Slides929 KB
Emma: Declarative Dataflows for Scalable Data Analysis
(TU Berlin DIMA)
State-of-the-art analytics platforms such as Flink and Spark expose a number of data-parallel APIs for bulk data analysis. Unfortunately, the usability of these APIs is hindered, as they are either (a) embedded in a general-purpose host language, but too low-level, or (b) declarative, but isolated as stand-alone languages. In this demo, we explain the cause and the effects of this problem and showcase Emma -- an attempt to reconcile this mismatch through better integration of database and language compilation technology.
Freeze: Isolated Local Debugging for Large Scale Big Data Analytics Systems
(Beuth Hochschule Berlin)
In this demonstration, we present Freeze. A subset extraction and replay tool for big data processing engines such as Apache Flink. The demonstration shows how Freeze extracts a subset for a specific use case and how the local replay process works to do the data analysis. This allows a fast visualization in the developers debugging environment of the errors that had been the root cause for the distributed system failure.
Adaptive Resource Management for Flink on Hadoop
(TU Berlin CIT)
Many important data-analysis jobs are executed repeatedly in production on clusters. Examples include daily executed batch jobs and iterative programs. For monitoring such jobs, we developed Freamon, a per job-level cluster monitoring system that collects detailed job profiles that include resource utilization, job runtimes, data placements, and job stages. Using Freamon, we are actively researching mechanisms for adaptive resource management for distributed datafl ow systems. In particular, we extended Apache Hadoop's container scheduling and data placement to improve resource utilization and decrease job runtimes for distributed dataflow jobs. Furthermore, we developed a tool that allows users to reserve resources for specific target runtimes using the historical job data available with Freamon.
Downloas Slides685 KB