Friday, October 30, 2020

Identifying domains of applicability of machine learning models for materials science

Christopher Sutton, Mario Boley, Luca M. Ghiringhelli, Matthias Rupp, Jilles Vreeken, Matthias Scheffler (2020)
Highlighted by Jan Jensen

Figure 3 from the paper (c) The authors 2020. Reproduced under the CC-BY license

This paper applies subgroup discovery (SGD) to detect domain applicability (DA) of three ML models for predicting formation energies of certain solid state materials. The authors define several DA features such as unit cell dimensions, composition, and interatomic distances. The features are different than the (much more complex) representations used as input to the ML models. The SGD algorithm then uses the DA features together with the ML-model errors to determine a selector (σf) by finding the largest possible subgroup of molecular systems (coverage) with the lowest possible error. 

The selector is a definition of this subgroup in terms of the some of the DA features, which are automatically chosen by the SGD algorithm. For example, the DA of one of the models is defined by three DA features: 


where "^" means "and". The MAE for this DA is 7.6 meV/cation, compared to 14.2 meV/cation for the test set used to train the ML model.

Interestingly, the three ML models this analysis was applied to had virtually the same overall MAEs but  different DAs and quite different MAEs within each domain. Also, the coverage of each DA varied considerably.

The SGD method appears to be a very useful and generally applicable tool for ML. The SGD algorithm used for this study is freely available here.