The methods used to develop new materials for basic research and new technological applications have expanded in recent years to a promising direction. In the summer of 2011, the U.S. President announced the "Materials Genome Initiative for Global Competitiveness". It aims at the development of new materials, for example, better batteries, improved or novel catalysts, solar cells, etc. that are urgently needed.
There exists theoretical and experimental data for tens of thousands of materials, and this number doubles every year. So far, these material databases were used essentially to query the data stored therein only when necessary (e.g., "high throughput screening"). But it is now essential to push for the next step and take advantage of the digital data availability to identify and understand correlations, hidden trends for detecting (potentially interesting) outliers, so far undetected causalities and special material properties. The intellectual challenge in this area is immense and requires primarily a solid modeling for the identification of physical descriptors.
Materials science data obtained from ab initio multiscale simulations (ab initio = fully based on parameter-free quantum-mechanical calculations of the electronic structure) are extremely heterogeneous (for example, electronic densities and wavefunctions, vibrational spectra, electron-phonon interactions, temperature dependencies, etc.). Correspondingly, the structures of these data are complex, so that it is still unclear and a large scientific and technical challenge may be how exactly the hierarchical levels can be found and combined algorithmically.
These issues will be examined and demonstrated on the concrete problem of searching for better thermoelectric materials. An extensive data set was and is generated in the context of a project funded by the Einstein Foundation (ETERNAL) and stored in the NOMAD (Novel Materials Discovery) repository.
Due to the particular complexity and non-linear dependencies of the different mechanisms that lead to the thermoelectric effect, the problem of finding better thermoelectric materials is a particularly suitable playground for developing and testing novel algorithms that can access different levels of the hierarchy adaptively and still remain highly scalable.
The first goal of the research in this application domain was to find descriptors and features which can be used for materials science problems, i.e. to find materials with a desired property from a huge set of materials data. For this, a model problem was defined, namely the classifications of zinc blende vs. rocksalt for different structures. This system allowed the development and successful test of a method for an automatic identification of a descriptor which is widely known from compressed sensing: the „linear absolute shrinkage and selection operator” (LASSO). Finally, the determined descriptor was characterized. In particular the causality between descriptor and property and the extrapolation power of the resulting model were confirmed.
NOMAD META INFO
The metadata structure defined for the NOMAD Laboratory (called NOMAD Meta Info) aims at defining a conceptual model to store the values connected to atomistic and ab initio calculations (THE NOMAD LABORATORY, A European Centre of Excellence).
In parallel, the NOMAD Archive was built up and presently contains millions of data points, consisting in code-independent results of electronic-structure calculations, thanks to a metadata infrastructure and a conversion layer that makes use of the latter.
This is the perfect basis for testing BBDC related methods on actual materials big data, as opposed to relatively small test-sets previously used.