Skip to main content

End-to-End Entity Resolution for Structured and Semi-Structured Data

Prof. Themis Palpanas, Senior Member of the French University Institute (IUF) France

11-Apr-2018, 11.30 to 13:00

Location:  DIMA, TU Berlin, E-N 719


Entity Resolution (ER) lies at the core of data integration, with a bulk of research focusing on both its effectiveness and time efficiency. Initially, most relevant works were crafted for structured (relational) data that are described by a schema of well-known quality and meaning. With the advent of Big Data, though, these early schema-based approaches became inapplicable, as the scope of ER moved to semi-structured data collections, which abound in noisy, semi-structured, voluminous and highly heterogeneous information.

In this talk, we take a close look on the entire ER workflow (from schema matching to entity clustering), covering both the schema-based and schema-agnostic cases. We will highlight recent works that significantly boost the efficiency of the overall workflow, especially meta-blocking, which cuts down on the computational cost by discarding comparisons that are repeated or lack sufficient evidence for producing duplicates. We will conclude with a brief demonstration of JedAI, our open-source reference toolbox for ER, which incorporates most of the state of the art techniques in the area.

Short Bio:

Themis Palpanas is Senior Member of the Institut Universitaire de France (IUF), a distinction that recognizes excellence across all academic disciplines, and professor of computer science at the Paris Descartes University (France), where he is director of diNo, the data management group. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of Trento, and at IBM T.J. Watson Research Center, and visited Microsoft Research, and the IBM Almaden Research Center.

His interests include problems related to data science (big data analytics and machine learning applications). He is the author of nine US patents, three of which have been implemented in world-leading commercial data management products. He is the recipient of three Best Paper awards, and the IBM Shared University Research (SUR) Award.

He is curently serving on the VLDB Endowment Board of Trustees, as an Editor in Chief for the BDR Journal, Associate Editor for VLDB 2019, Associate Editor in the TKDE, and IDA journals, as well as on the Editorial Advisory Board of the IS journal, and the Editorial Board of the TLDKS Journal. He has served as General Chair for VLDB 2013, Associate Editor for VLDB 2017, and Workshop Chair for EDBT 2016, ADBIS 2013, and ADBIS 2014, General Chair for the PDA@IOT International Workshop (in conjunction with VLDB 2014), and General Chair for the Event Processing Symposium 2009.