Big Data Excellence in Germany and UK
This event by the Berlin Big Data Center and UK Science and Innovation Network was held on March 1st, 2017. The Smart Data Forum (SDF) hosted the event where an audience of about 70 people heard about UK’s big data landscape and the Berlin Big Data Centre project.
During three sessions, 11 speakers informed about more theoretically topics like data analytics, big data management, machine learning, and about application areas such as E-health or food watch. They also discuss data security, data governance, and the handling of this important issues in UK and Germany what raised a lot of attention.
There has been also a discussion about how the UK and Germany are building up national networks for interdisciplinary collaboration and how researchers are collaborating with government and industry in practice.
|09:30 – 09:45||Arrival and Coffee|
|09:45 – 10:00||Short Welcome|
|10:00 – 11:30||Presentation of UK Landscape and BBDC|
|Sofia Olhed (UCL Big Data Institute), Volker Markl (BBDC)|
|11:30 – 12:00||BBDC Success Stories / Data Streaming and Programmability|
|Alexander Alexandrov (TU Berlin), Tilmann Rabl (BBDC)|
|12:00 – 13:00||E-Health|
|Tim Conrad (FU Berlin/ BBDC), Harry Hemingway (UCL/ Farr Institute London)|
|13:00 – 14:00||Lunch|
|14:00 – 15:00||Big Data Management|
|Ziawasch Abedjan (TU Berlin/ BBDC), Peter Pietzuch (Imperial College London)|
|15:00 – 16:00||Machine Learning|
|Klaus-Robert Müller (TU Berlin/ BBDC), Patrick Wolfe (UCL Big Data Institute)|
|16:00 – 16:30||Coffee|
|16:30 – 16.45||Using Data for Policy in the UK|
|Sian Thomas (Food Standards Agency UK)|
|16:45 – 18:00||Open Discussion and Q&A|
|Panel: Sofia Olhede, Patrick Wolfe, Sian Thomas, Volker Markl|
|18:00 – 20:00||Demos, Drinks and Networking|
Reflecting back on the event
Presentation of UK Landscape and BBDC
Prof. Olhede gave an overview of the big data landscape from varying perspectives, including the UK and abroad. She spoke about groundbreaking advances, such as new big data paradigms, prediction vs. estimation, and the current and future importance of decision-making. Additionally, she discussed ethical issues, such as user privacy and the transparency of data use. She briefly described some data governance services, as well as some big data research initiatives around the world.
Prof. Markl’s presentation offered an overview of the Berlin Big Data Center (BBDC), including its research, innovation, and education-oriented activities. He talked about one big challenge, namely, building a big data analytics system that allows for the automatic translation, execution, and optimization of data analysis programs for any underlying architecture, while coping with various data distributions and workloads. The development of such a system would empower data scientists to focus on the core analysis problem, as opposed to worrying about ensuring satisfactory runtimes and scalability by low-level systems programming. From a broader perspective, the vision encompasses a mosaic of theories, systems, and hardware technologies.
BBDC Success Stories / Data Streaming and Programmability
Dr. Rabl emphasized a core BBDC technology, namely, Apache Flink. He described the basic architecture, its stream processing capabilities, memory management, and a range of applications. He compared Apache Flink with other streaming engines and demonstrated how Flink fits into the BBDC technology stack.
Mr. Alexandrov talked about Emma, a novel domain-specific language, which aims to replicate the successes brought about by relational database management systems (RDBMS) and SQL in the past forty years to big data. That is, transitioning from “data querying via RDBMS and a declarative language (SQL)” to “conducting analysis via distributed collections and parallel dataflow engines.” As of today, such a declarative language that can ease analysis does not yet exist in the big data stack. However, Emma represents a promising starting point.
In his presentation, Prof. Conrad discussed a big data application, namely, the “Human Body as a Source of Big Data.” The focus was on the prediction of diseases, via an analysis of the blood, which would initially yield a fingerprint as a short-term goal. In the long-term, a decision support system for medical diagnostics would be developed, that is capable of handling as much data as is available. This would require methods to help identify typical diseases in (blood) fingerprints and finding fingerprints in a patient’s blood data. Conducting such an analysis is challenging due to the size of the datasets, data noise, and the need for novel feature selection methods.
In his talk, on “Our health through big data: Six views,” Prof. Hemingway discussed the importance of mining healthcare data. For example, to enable data driven precision medicine, drug development, and disease prevention. All themes par excellence on matters of health, disease, and healthcare systems. He illustrated the types of data that are needed by researchers and medical doctors, such as personal health data (e.g., socioeconomic and lifestyle factors) and social network data (e.g., physical environment), to enable further medical advances. Furthermore, he expressed that no country alone could realize precision medicine, instead that we need international cooperation to enable health record big data.
Big Data Management
In his presentation, Prof. Abedjan took a closer look at data variety. Predominantly, the sources of data arising in big data applications (e.g., e-commerce, sensor, social media), which are very large and highly diverse. He emphasized a finding contained in a 2016 data science publication, i.e., Data scientists spend 80% of their time on data preparation. Of which, 60% is attributed to data cleaning and data organization. Moreover, data heterogeneity impedes the seamless integration of different sources, thereby, demanding great involvement in the data analysis workflow. Furthermore, he offered an overview of current research in reducing large datasets to their relevant core information, using machine learning, data mining (e.g., summarization), or other techniques (e.g., sketching).
In his presentation, Dr. Pietzuch talked about the varying challenges faced when conducting large-scale analysis using big data systems, particularly, performance and usability. For example, he spoke about some approaches that are used to increase performance, such as the exploitation of parallel hardware solutions (e.g., GPUs). In this context, he introduced the SABER Big Data Engine, which is based on the idea of a query-independent data parallelization model and execution model that permits task execution on all heterogeneous processors (e.g., CPUs, GPUs). Concerning usability, he emphasized the importance of employing the proper programming abstractions, for example, applying distributed stateful dataflow graphs (SDGs) as an approach to enable imperative big data programming for the masses.