Year:  
All - 2014  ...  2016 2017 2018 2019
← Select a Year 
References
Chen Xu, Rudi Poepsel Lemaitre, Juan Soto, Volker Markl
Fault-Tolerance for Distributed Iterative Dataflows in Action
PVLDB 11(12), :1990-1993
2018

Abstract: Distributed dataflow systems (DDS) are widely employed in graph processing and machine learning (ML), where many of these algorithms are iterative in nature. Typically, DDS achieve fault-tolerance using checkpointing mechanisms or they exploit algorithmic properties to enable fault-tolerance without the need for checkpoints. Recently, for graph processing, we proposed utilizing unblocking checkpointing, to parallelize the execution pipeline and checkpoint writing, as well as confined recovery, to enable fast recovery upon partial node failures. Furthermore, for ML algorithms implemented using broadcast variables, we proposed utilizing replica recovery, to leverage broadcast variable replicas and facilitate failure recovery checkpointing-free. In this demonstration, we showcase these fault-tolerance techniques using Apache Flink. Attendees will be able to: (i) run representative iterative algorithms including PageRank, Connected Components, and K-Means, (ii) explore the internal behavior of DDS under the influence of unblocking checkpointing, and (iii) trigger failures, to observe the effects of confined recovery and replica recovery.

Traub, Jonas; Grulich, Philipp; Rodríıguez Cuéllar, Alejandro; Breß, Sebastian; Katsifodimos, Asterios; Rabl, Tilmann; Markl, Volker
Scotty: Efficient Window Aggregation for out-of-order Stream Processing
, page 1300-1303.
2018

Abstract: Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates among windows. However, these techniques do not support out-of-order processing and session windows. Out-of-order processing is a key requirement to deal with delayed tuples in case of source failures such as temporary sensor outages. Session windows are widely used to separate different periods of user activity from each other. In this paper, we present Scotty, a high throughput operator for window discretization and aggregation. Scotty splits streams into non-overlapping slices and computes partial aggregates per slice. These partial aggregates are shared among all concurrent queries with arbitrary combinations of tumbling, sliding, and session windows. Scotty introduces the first slicing technique which (1) enables stream slicing for session windows in addition to umbling and sliding windows and (2) processes out-of-order tuples efficiently. Our technique is generally applicable to a broad group of dataflow systems which use a unified batch and stream processing model. Our experiments show that we achieve a throughput an order of magnitude higher than alternative stateof-the-art solutions.

Quoc-Cuong To, Juan Soto, Volker Markl
A survey of state management in big data processing systems
VLDB J. 27(6), :847-872
2018

Abstract: The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache Storm. Given the pivotal role that state management plays, particularly, for iterative batch and stream processing, in this survey, we present examples of state as an enabler, discuss the alternative approaches used to handle and implement state, capture the many facets of state management, and highlight new research directions. Our aim is to provide insight into disparate state management techniques, motivate others to pursue research in this area, and draw attention to open problems.

Niklas Stoehr, Johannes Meyer, Volker Markl, Qiushi Bai, Taewoo Kim, De-Yu Chen, Chen Li
Heatflip: Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data
, page 3723-3732.
2018

Abstract: Keyword-based heat maps are a natural way to explore and analyze the spatial properties of social media data. Dealing with large datasets, there may be many different keywords, making offline pre-computations very hard. Interactive frameworks that exploit database sampling can address this challenge. We present a novel middleware technique called Heatflip, which issues diametrically opposed samples into the temporal and spatial dimensions of the data stored in an external database. Spatial samples provide insights into the temporal distribution and vice versa. The progressive exploration approach benefits from adaptive indexing and combines the retrieval and visualization of the data in a middleware layer. Without any a priori knowledge of the underlying data, the middleware can generate accurate heat maps in 85% shorter processing times than conventional systems. In this paper, we discuss the analytical background of Heatflip, showcase its scalability, and validate its performance when visualizing large amounts of social media data.

Seibert, Felix; Peters, Mathias; Schintke, Florian
mproving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing.
Proceedings of the IPDPSW; Fourth IEEE International Workshop on High Performance Big Data, Deep Learning and Cloud Computing (HPBDC), Volume 2018 , page 448-457.
2018

Note: Best paper award

Renner, T.; Müller, J.; Kao, O.
"Endolith: A Blockchain-Based Framework to Enhance Data Retention in Cloud Storages,"
26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Cambridge , page 627-634..
2018

Abstract: Blockchains like Bitcoin and Ethereum have seen significant adoption in the past few years and show promise to design applications without any centralized reliance on third parties. In this paper, we present Endolith, an auditing framework for verifying file integrity and tracking file history without third party reliance using a smart contract-based blockchain. Annotated files are continuously monitored and metadata about changes including file hashes are stored tamper-proof on the blockchain. Based on this, Endolith can prove that a file stored a long time ago has not been changed without authorization or, if it did, track when it has changed, by whom. Endolith implementation is based on Ethereum and Hadoop Distributed File System (HDFS). Our evaluation on a public blockchain network shows that Endolith is efficient for files that are infrequently modified but often accessed, which are common characteristics of data archives.

Clemens Lutz, Sebastian Breß, Tilmann Rabl, Steffen Zeuch, Volker Markl
Efficient and Scalable k‑Means on GPUs
Datenbank-Spektrum 18(3), :157-169
2018

Abstract: k-Means is a versatile clustering algorithm widely used in practice. To cluster large data sets, state-of-the-art implementations use GPUs to shorten the data to knowledge time. These implementations commonly assign points on a GPU and update centroids on a CPU. We identify two main shortcomings of this approach. First, it requires expensive data exchange between processors when switching between the two processing steps point assignment and centroid update. Second, even when processing both steps of k-means on the same processor, points still need to be read two times within an iteration, leading to inefficient use of memory bandwidth. In this paper, we present a novel approach for centroid update that allows us to efficiently process both phases of k-means on GPUs. We fuse point assignment and centroid update to execute one iteration with a single pass over the points. Our evaluation shows that our k-means approach scales to very large data sets. Overall, we achieve up to 20 × higher throughput compared to the state-of-the-art approach.

Jeyhun Karimov, Tilmann Rabl, Volker Markl
PolyBench: The First Benchmark for Polystores
, page 24-41.
2018

Abstract: Modern business intelligence requires data processing not only across a huge variety of domains but also across different paradigms, such as relational, stream, and graph models. This variety is a challenge for existing systems that typically only support a single or few different data models. Polystores were proposed as a solution for this challenge and received wide attention both in academia and in industry. These are systems that integrate different specialized data processing engines to enable fast processing of a large variety of data models. Yet, there is no standard to assess the performance of polystores. The goal of this work is to develop the first benchmark for polystores. To capture the flexibility of polystores, we focus on high level features in order to enable an execution of our benchmark suite on a large set of polystore solutions.

Thamsen, Lauritz; Verbitskiy, Ilya ,; Rabier, Benjamin; Kao, Odej
Learning Efficient Co-locations for Scheduling Distributed Dataflows in Shared Clusters
In Services Transactions on Big Data (Vol. 4, No. 1). Services Society.
2018

Abstract: Resource management systems like YARN or Mesos allow sharing cluster resources by running data-parallel processing jobs in temporarily reserved containers. Containers, in this context, are logicalleases of resources as, for instance, a number of cores and main memory, allocated on a particularnode. Typically, containers are used without resource isolation to achieve high degrees of overallresource utilization despite the often fluctuating resource usage of single analytic jobs. However, somecombinations of jobs utilize the resources better and interfere less with each other when running on thesame nodes than others. This paper presents an approach for improving the resource utilization and job throughput whenscheduling recurring distributed data-parallel processing jobs in shared cluster environments. Usinga reinforcement learning algorithm, the scheduler continuously learns which jobs are best executedsimultaneously on the cluster. We evaluated a prototype implementation of our approach with HadoopYARN, exemplary Flink jobs from different application domains, and a cluster of commodity nodes.Even though the measure we use to assess the goodness of schedules can still be improved, the resultsof our evaluation show that our approach increases resource utilization and job throughput.

Janßen, Gerrit; Verbitskiy, Ilya; Renner, Thomas; Thamsen, Lauritz
Scheduling Stream Processing Tasks on Geo-Distributed Heterogeneous Resources.
2018 IEEE International Conference on Big Data (IEEE BigData). Presented at the First International Workshop on the Internet of Things Data Analytics (IoTDA)
2018

Abstract: Low-latency processing of data streams from distributed sensors is becoming increasingly important for a growing number of IoT applications. In these environments sensor data collected at the edge of the network is typically transmitted in a number of hops: from devices to intermediate resources to clusters of cloud resources. Scheduling processing tasks of dataflow jobs on all the resources of these environments can significantly reduce application latencies and network congestion. However, for this schedulers need to take the heterogeneity of processing resources and network topologies into account.This paper examines multiple methods for scheduling distributed dataflow tasks on geo-distributed, heterogeneous resources. For this, we developed an optimization function that incorporates the latencies, bandwidths, and computational resources of heterogeneous topologies. We evaluated the different placement methods in a virtual geo-distributed and heterogeneous environment with an IoT application. Our results show that metaheuristic methods that take service quality metrics into account can find significantly better placements than methods that only take topologies into account, with latencies reduced by almost 50%.

Schmidtke, Robert; Schintke, Florian; Schütt, Thorsten
From Application to Disk: Tracing I/O Through the Big Data Stack.
High Performance Computing ISC High Performance 2018 International Workshops
2018
Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, Volker Markl
Generating custom code for efficient query execution on heterogeneous processors
VLDB J. 27(6), :797-822
2018

Abstract: Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error- prone process. In this paper, we propose concepts to adapt to and to exploit the performance enhancements of modern processors automatically. Our core idea is to create processor-specific code variants and to learn a well-performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic- and processor-specific code transformations. Our experimental results show that the performance of code variants may diverge up to two orders of magnitude. In order to achieve peak performance, we generate custom code for each processor. We show that our approach finds an efficient custom code variant for multi-core CPUs, GPUs, and MICs.

Behrens, Tobias; Rosenfeld, Viktor; Traub, Jonas; Breß, Sebastian; Markl, Volker
Efficient SIMD Vectorization for Hashing in OpenCL
, page 489-492.
2018

Abstract: Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – low-level APIs in which functions map to processorspecific SIMD instructions. Intrinsics are specific to a processor architecture and result in complex and difficult to maintain code. OpenCL is a parallel programming framework which provides a higher abstraction level than intrinsics and is portable to different processors. Thus, OpenCL avoids processor dependencies, which results in improved code maintainability. In this paper, we add efficient, vectorized hashing primitives to OpenCL. Our results show that OpenCL-based vectorization is competitive to intrinsics on CPUs but not on Xeon Phi coprocessors.

Verbitskiy, Ilya; Thamsen, Lauritz; Renner, Thomas; Kao, Odej
CoBell: Runtime Prediction for Distributed Dataflow Jobs in Shared Clusters
10th IEEE International Conference on Cloud Computing Technology and Science (CloudCom)
2018

Abstract: Low-latency processing of data streams from distributed sensors is becoming increasingly important for a growing number of IoT applications. In these environments sensor data collected at the edge of the network is typically transmitted in a number of hops: from devices to intermediate resources to clusters of cloud resources. Scheduling processing tasks of dataflow jobs on all the resources of these environments can significantly reduce application latencies and network congestion. However, for this schedulers need to take the heterogeneity of processing resources and network topologies into account.This paper examines multiple methods for scheduling distributed dataflow tasks on geo-distributed, heterogeneous resources. For this, we developed an optimization function that incorporates the latencies, bandwidths, and computational resources of heterogeneous topologies. We evaluated the different placement methods in a virtual geo-distributed and heterogeneous environment with an IoT application. Our results show that metaheuristic methods that take service quality metrics into account can find significantly better placements than methods that only take topologies into account, with latencies reduced by almost 50%.

Thamsen, Lauritz; Renner, Thomas; Verbitskiy, Ilya; Kao, Odej
Adaptive Resource Management for Distributed Data Analytics
In Lucio Grandinetti, Seyedeh Leili Mirtaheri, Reza Shahbazian, Thomas Sterling, Vladimir Voevodin (eds.), Advances in Parallel Computing – Big Data and HPC: Ecosystem and Convergence. IOS Press
2018

Abstract: Increasingly large datasets make scalable and distributed data analytics necessary. Frameworks such as Spark and Flink help users in efficiently utilizing cluster resources for their data analytics jobs. It is, however, usually difficult to anticipate the runtime behavior and resource demands of these distributed data analytics jobs. Yet, many resource management decisions would benefit from such information. Addressing this general problem, this chapter presents our vision of adaptive resource management and reviews recent work in this area. The key idea is that workloads should be monitored for trends, patterns, and recurring jobs. These monitoring statistics should be analyzed and used for a cluster resource management calibrated to the actual workload. In this chapter, we motivate and present the idea of adaptive resource management. We also introduce a general system architecture and we review specific adaptive techniques for data placement, resource allocation, and job scheduling in the context of our architecture.