One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to systematic inference of structural information and statistics of a dataset using dedicated profiling tools. In this course, we will discuss the importance of data profiling as part of any data-related use-case, and shed light on the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and their application in new data discovery and data analytics system. We conclude with directions for future research in the area of data profiling.
Ziawasch Abedjan is an assistant professor and the head of the "Big Data Management" (BigDaMa) Group at the TU Berlin in Germany and a Principal Investigator in the Berlin Big Data Center. Prior to that, Ziawasch was a postdoctoral associate at MIT CSAIL where he worked on various data integation topics. He received his PhD from the Hasso Plattner Institute in Potsdam, Germany, where he worked on methods for mining Linked Open Data. His current research focuses on data integration and data profiling. He is the recipient of the 2014 CIKM Best Student Paper Award, the 2015 SIGMOD Best Demonstration Award, and the 2014 Best Dissertation Award from the University of Potsdam.