Scalable Data Analysis

Today’s programming languages are ill-suited for big data analysis because they: (a) do not allow for the specification of iterative algorithms, (b) do not support ordered and unordered data in a paradigm, and (c) are not truly declarative and thus not automatically optimizable and scalable. To solve this challenge, the BBDC will address the following research questions:

  • How can a declarative specification be obtained even when dealing with iterative algorithms, state and orderly collections?
  • What are the mathematical and algebraic constructs necessary to achieve this?
  • What equivalence rules and automatic optimization opportunities arise through declarative specification?
  • How can fault tolerance be specified declaratively?
  • How can consistency and fault-tolerance requirements for a declarative specification of an iterative data analysis algorithm be derived automatically?

To facilitate the analysis of large volumes of heterogeneous data with complex machine learning, image, video, and text analysis methods, we need to extend existing parallel programming models by incorporating varying concepts, such as ordered collections, multi-dimensionality, and access to a distributed state within & between iterative algorithm execution steps. Further specification options for different degrees of fault tolerance and consistency in the distributed execution are also required. In this way we want to create a foundation for Big Data Analytics systems analogous to the relational algebra for database systems.