Big Data Excellence in Germany and UK

Big Data Excellence in Germany and UK, a joint event by the Berlin Big Data Center and UK Science and Innovation Network.

Panel discussion with Klaus-Robert Müller, Sofia Olhede, Sian Thomas, Volker Markl, Jack Thoms, and Patrick Wolfe (f. l. to r.)

This event by the Berlin Big Data Center and UK Science and Innovation Network was held on March 1st, 2017. The Smart Data Forum (SDF) hosted the event where an audience of about 70 people heard about UK’s big data landscape and the Berlin Big Data Centre project.

During three sessions, 11 speakers informed about more theoretically topics like data analytics, big data management, machine learning, and about application areas such as E-health or food watch. They also discuss data security, data governance, and the handling of this important issues in UK and Germany what raised a lot of attention.

There has been also a discussion about how the UK and Germany are building up national networks for interdisciplinary collaboration and how researchers are collaborating with government and industry in practice.

09:30 – 09:45Arrival and Coffee

 

 

09:45 – 10:00Short Welcome

BBDC/ UK Science & Innovation Network

10:00 – 11:30 Presentation of UK Landscape and BBDC
Sofia Olhed (UCL Big Data Institute), Volker Markl (BBDC)
11:30 – 12:00BBDC Success Stories / Data Streaming and Programmability
Alexander Alexandrov (TU Berlin), Tilmann Rabl (BBDC)
12:00 – 13:00E-Health
Tim Conrad (FU Berlin/ BBDC), Harry Hemingway (UCL/ Farr Institute London)

 

 

13:00 – 14:00Lunch

 

14:00 – 15:00 Big Data Management
Ziawasch Abedjan (TU Berlin/ BBDC), Peter Pietzuch (Imperial College London)
15:00 – 16:00Machine Learning
Klaus-Robert Müller (TU Berlin/ BBDC), Patrick Wolfe (UCL Big Data Institute)

 

16:00 – 16:30Coffee

 

16:30 – 16.45Using Data for Policy in the UK
Sian Thomas (Food Standards Agency UK)
16:45 – 18:00Open Discussion and Q&A

 

Panel: Sofia Olhede, Patrick Wolfe, Sian Thomas, Volker Markl
18:00 – 20:00Demos, Drinks and Networking

Reflecting back on the event

Alongside the BBDC PIs, researchers from UCL Big Data Institute, Imperial College London and Farr Institute London has been among the speakers.

Presentation of UK Landscape and BBDC

  • Prof. Dr. Sofia Olhede

    Scientific Director, University College London (UCL Big Data Institut).

    Prof. Olhede gave an overview of the big data landscape from varying perspectives, including the UK and abroad. She spoke about groundbreaking advances, such as new big data paradigms, prediction vs. estimation, and the current and future importance of decision-making. Additionally, she discussed ethical issues, such as user privacy and the transparency of data use. She briefly described some data governance services, as well as some big data research initiatives around the world.

  • Prof. Dr. Volker Markl

    Director, Berlin Big Data Center, Technische Universität Berlin and Research Department Head, German Research Center for Artificial Intelligence (DFKI).

    Prof. Markl’s presentation offered an overview of the Berlin Big Data Center (BBDC), including its research, innovation, and education-oriented activities. He talked about one big challenge, namely, building a big data analytics system that allows for the automatic translation, execution, and optimization of data analysis programs for any underlying architecture, while coping with various data distributions and workloads. The development of such a system would empower data scientists to focus on the core analysis problem, as opposed to worrying about ensuring satisfactory runtimes and scalability by low-level systems programming. From a broader perspective, the vision encompasses a mosaic of theories, systems, and hardware technologies.

BBDC Success Stories / Data Streaming and Programmability

  • Dr. Tilmann Rabl

    Scientific Coordinator, Berlin Big Data Center, Technische Universität (TU) Berlin.

    Dr. Rabl emphasized a core BBDC technology, namely, Apache Flink. He described the basic architecture, its stream processing capabilities, memory management, and a range of applications. He compared Apache Flink with other streaming engines and demonstrated how Flink fits into the BBDC technology stack.

  • Mr. Alexander Alexandrov

    Doctoral student, Database and Information Management Group, Technische Universität (TU) Berlin.

    Mr. Alexandrov talked about Emma, a novel domain-specific language, which aims to replicate the successes brought about by relational database management systems (RDBMS) and SQL in the past forty years to big data.  That is, transitioning from “data querying via RDBMS and a declarative language (SQL)” to “conducting analysis via distributed collections and parallel dataflow engines.” As of today, such a declarative language that can ease analysis does not yet exist in the big data stack. However, Emma represents a promising starting point.

E-Health

  • Prof. Dr. Tim Conrad

    Medical Bioinformatics Group, Freie Universität Berlin.

    In his presentation, Prof. Conrad discussed a big data application, namely, the “Human Body as a Source of Big Data.” The focus was on the prediction of diseases, via an analysis of the blood, which would initially yield a fingerprint as a short-term goal. In the long-term, a decision support system for medical diagnostics would be developed, that is capable of handling as much data as is available. This would require methods to help identify typical diseases in (blood) fingerprints and finding fingerprints in a patient’s blood data. Conducting such an analysis is challenging due to the size of the datasets, data noise, and the need for novel feature selection methods.

  • Professor Harry Hemingway

    Director, Institute of Health Informatics, University College London; Director, Farr Institute of Health Informatics Research; Director, Healthcare Informatics, Genomics and Data Science, NIHR UCL UCLH Biomedical Research Centre.

    In his talk, on “Our health through big data: Six views,” Prof. Hemingway discussed the importance of mining healthcare data. For example, to enable data driven precision medicine, drug development, and disease prevention. All themes par excellence on matters of health, disease, and healthcare systems. He illustrated the types of data that are needed by researchers and medical doctors, such as personal health data (e.g., socioeconomic and lifestyle factors) and social network data (e.g., physical environment), to enable further medical advances. Furthermore, he expressed that no country alone could realize precision medicine, instead that we need international cooperation to enable health record big data.

Big Data Management

  • Prof. Dr. Ziawasch Abedjan

    Assistant Professor, Big Data Management Group, Technische Universität (TU) Berlin

    In his presentation, Prof. Abedjan took a closer look at data variety. Predominantly, the sources of data arising in big data applications (e.g., e-commerce, sensor, social media), which are very large and highly diverse. He emphasized a finding contained in a 2016 data science publication, i.e., Data scientists spend 80% of their time on data preparation. Of which, 60% is attributed to data cleaning and data organization. Moreover, data heterogeneity impedes the seamless integration of different sources, thereby, demanding great involvement in the data analysis workflow. Furthermore, he offered an overview of current research in reducing large datasets to their relevant core information, using machine learning, data mining (e.g., summarization), or other techniques (e.g., sketching).

  • Dr. Peter Pietzuch

    Head, Large-Scale Distributed Systems Group, Imperial College London

    In his presentation, Dr. Pietzuch talked about the varying challenges faced when conducting large-scale analysis using big data systems, particularly, performance and usability. For example, he spoke about some approaches that are used to increase performance, such as the exploitation of parallel hardware solutions (e.g., GPUs). In this context, he introduced the SABER Big Data Engine, which is based on the idea of a query-independent data parallelization model and execution model that permits task execution on all heterogeneous processors (e.g., CPUs, GPUs). Concerning usability, he emphasized the importance of employing the proper programming abstractions, for example, applying distributed stateful dataflow graphs (SDGs) as an approach to enable imperative big data programming for the masses.