Skip to main content

About Data Scientists and "The New Oil"

Data Scientist - Bridging the Talent Gap

Interaction of application, scalable data management and machine learning.

According to the Harvard Business Review, Data Scientist is “The Sexiest Job of the 21st Century”. Data scientists are often considered to be wizards that deliver value from big data. These wizards need to have knowledge in three very distinct subject areas, namely, scalable data management, data analysis and domain area expertise. However, it is a challenge to find these jacks-of-all-trades that cover all three areas. Or, as the Wall Street Journal puts it “Big Data’s Problem is Little Talent”. Naturally, finding talented data scientists is also a requirement, if we are to put big data to good use. If data analysis were specified using a declarative language, data scientists would not have to worry about low-level programming any longer. Instead, they would be free to concentrate on their data analysis problem. The goal of the Berlin Big Data Center is to help bridge the Talent Gap of Big Data through researching and developing novel technology. Our starting point is the Apache Flink system. We aim to enable deep analytics of huge heterogeneous data sets with low latency by developing advanced, scalable data analysis and machine learning methods. Our goal is to specify in these methods a declarative way and optimize and parallelize them automatically, in order to empower data scientists to focus on the analysis problem at hand. That is, relieving them from the need to be system programmers.

Read more about it in the article of the VLDB keynote "Breaking the Chains: On Declarative Data Analysis and Data Independence in the Big Data Era" by Prof. Markl.

Big Data is a Big Deal!

One often hears that ”data is the new oil”. Like oil, data is a complex product derived from numerous processing and refinement steps. Similarly, one can draw an analogy for the big data realm. Data drilling stations are, for example, information extraction and integration methods, which extract and enrich semantics from crude data. The refineries are data analysis and mining algorithms, systems, and tools, which cluster, group, and characterize the data in a new way in order to derive insight and actionable information. We already see an entire economy of distribution networks emerging around big data, with information marketplaces that sell transformed, semantically enriched, and further augmented forms of data. Transport and logistics companies  are starting to use Big Data solutions for vehicle tracking and fleet management, Industrie 4.0 uses big data analytics to enable smart manufacturing and also in healthcare big data applications are starting to emerge. Big data will not only accelerate, but even change many scientific processes and have a profound impact on business, science, and the society as a whole.

Five Dimensions of Big Data

Big data is often defined as any data set that cannot be handled using today’s widely available mainstream techniques and technologies. The challenges of handling big data are often described using 3-Vs (volume, variety and velocity): high volume of data from a variety of data sources arriving with high velocity analysed to achieve an economic benefit. However, the 3-Vs fail to reflect complexity of “Big Data” in its entirety. The real complexity from a technical perspective stems from the fact that complex predictive and prescriptive analytic methods need to be applied to huge, heterogeneous data sets. However, “Big Data” (or often also called “Smart Data”) has a much wider scope and has challenges and opportunities in 5 dimensions: technology, application, economic, legal and social.

Technology: There is a need for scalable systems and platforms for data analysis, novel data analysis methods, and in particular technologies to help overcome the skills gap (e.g., enabling data analysis methods to be accessible to a wider audience). 

Application: Many novel applications are emerging in the information economy, such as information marketplaces, which refine and sell enriched data. These information marketplaces are effectively bootstrapping the information economy. Other examples include personalized medicine, Industry 4.0, and digital humanities.

Economic: The challenges and opportunities in the economic dimension lie in new business models and content delivery paradigm shifts (e.g., information pricing and the role of open¬-source software)

Legal: From a legal perspective, big data will present many challenges with respect to ownership, liability, and insolvency, in addition to prevalent issues, such as privacy and security.

Social: Lastly, data driven innovation will have a profound impact on society as a whole with respect to social interaction, news, and democratic processes, among others.

The focus of the BBDC is primarily on addressing challenges in the technology dimension, by researching and developing novel scalable data management systems that can process advanced data analytics methods, with the goal to demonstrate the results in selected applications. However, by exploring and validating use-cases with our partners, we also contribute to ideas and solutions to challenges in the further dimensions.