In this seminar, we describe our approach to querying genomic datasets; the talk is divided in two parts, first we describe our approach and vision and then we focus on the technology which is currently deployed.
In the first part, we define our approach to genomic data management and specifically we focus on tertiary data management, i.e. the need of integrating region-based information describing heterogeneous experimental datasets in order to support biological and clinical discoveries. In this part we define GenoMetric Query Language (GMQL) as a high-level algebraic language for manipulating genomic datasets consisting of regions and metadata. We also explain our plans for building an integrated repository of open data and for supporting ontological search on metadata and pattern-based search on regions, thereby moving beyond the current state-of-art.
In the second part, we describe how GMQL is currently implemented on a cluster of nodes and uses Spark, Flink, SciDB as underlying parallel data flow engines and scientific databases. We illustrate how a query is translated to a DAG representing operations over metadata and regions, and then how some of the operations are translated into Spark (for example). We next describe the architecture of our framework GDMS (Scalable Genomic Data Management System) at Cineca (https://www.cineca.it/), which makes use of a cluster of nodes.