Talk "Open Data Integration"
Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However scientists generally do not have a priori knowledge about what data is available (its schema or content), but will want to be able to use Open Data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called “query discovery” where the main task is to discover a query (or transformation script) that transforms data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. In this talk, I introduce a new paradigm for thinking about Open Data Integration where the focus is on “data discovery”, but highly efficient internet-scale discovery that is heavily query-aware. As an example, a join-aware discovery algorithm finds datasets, within a massive data lake, that join (in a precise sense of having high containment) with a known dataset. I describe a research agenda and recent progress in developing scalable query-aware data discovery algorithms.
Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University. She is a Fellow of the Royal Society of Canada, Canada’s National Academy of Science, Engineering and the Humanities. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her co-authors (Fagin, Kolaitis and Popa) received the (10 Year) ICDT Test-of-Time Award for their influential 2003 paper establishing the foundations of data exchange. Professor Miller has led the NSERC Business Intelligence Strategic Network and was elected president of the non-profit Very Large Data Base Foundation. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s of science degrees in Mathematics and Cognitive Science from MIT.