Friday, October 30, 2020

Identifying domains of applicability of machine learning models for materials science

Christopher Sutton, Mario Boley, Luca M. Ghiringhelli, Matthias Rupp, Jilles Vreeken, Matthias Scheffler (2020)
Highlighted by Jan Jensen

Figure 3 from the paper (c) The authors 2020. Reproduced under the CC-BY license

This paper applies subgroup discovery (SGD) to detect domain applicability (DA) of three ML models for predicting formation energies of certain solid state materials. The authors define several DA features such as unit cell dimensions, composition, and interatomic distances. The features are different than the (much more complex) representations used as input to the ML models. The SGD algorithm then uses the DA features together with the ML-model errors to determine a selector (σf) by finding the largest possible subgroup of molecular systems (coverage) with the lowest possible error. 

The selector is a definition of this subgroup in terms of the some of the DA features, which are automatically chosen by the SGD algorithm. For example, the DA of one of the models is defined by three DA features: 

where "^" means "and". The MAE for this DA is 7.6 meV/cation, compared to 14.2 meV/cation for the test set used to train the ML model.

Interestingly, the three ML models this analysis was applied to had virtually the same overall MAEs but  different DAs and quite different MAEs within each domain. Also, the coverage of each DA varied considerably.

The SGD method appears to be a very useful and generally applicable tool for ML. The SGD algorithm used for this study is freely available here.

Monday, September 28, 2020

Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors

Yanfei Guan, Connor W. Coley, Haoyang Wu, Duminda Ranasinghe, EstherHeid, Thomas J. Struble, Lagnajit Pattanaik, William H. Green, and Klavs F. Jensen (2020)

Highlighted by Jan Jensen

Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license

The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.

Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.

The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.

Monday, August 31, 2020

A community-powered search of machine learning strategy space to find NMR property prediction models

Lars A. Bratholm, Will Gerrard, Brandon Anderson, Shaojie Bai, Sunghwan Choi, Lam Dang, Pavel Hanchar, AddisonHoward, Guillaume Huard, Sanghoon Kim, Zico Kolter, Risi Kondor, Mordechai Kornbluth, YouhanLee, Youngsoo Lee, Jonathan P. Mailoa, Thanh Tu Nguyen, Milos Popovic, Goran Rakocevic, Walter Reade, Wonho Song, Luka Stojanovic, Erik H. Thiede, Nebojsa Tijanic, Andres Torrubia, Devin Willmott, Craig P. Butts, David R. Glowacki, & Kaggle participants (2020)

Highlighted by Jan Jensen


Figure 1a and 1b from the paper (c) The authors. Reproduced under the CC-BY licence

Disclaimer: I was Lars Bratholms PhD advisor

This paper describes the results of a Kaggle competition called Champs for developing ML models that predict NMR coupling constants with DFT accuracy. 

In a Kaggle competition the host of the competition provides a public training and test set. Participants use these datasets to develop ML models, which the site then evaluates on a private test set. The accuracy of each model is posted and the object of the competition is submit the most accurate model before the end of the competition. Competitors can submit as often as they want during the competition, which in this case lasted 3 months. The winners receive cash prices: in this case the top 5 models received \$12.5K, \$7.5K, \$5K, \$3K, and \$2K, respectively.
"[Champs] received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemblemodel constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art."
Two of the top 5 teams had no domain specific expertise.

Is this the way of the future? Should any chemistry ML proposal include Kaggle prize money in the budget? I don't see any scientific reasons why not.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, July 29, 2020

OrbNet: Deep Learning for Quantum Chemistry Using Symmetry-Adapted Atomic-Orbital Features

Figure 4 from the paper. (c) the authors 2020.

This method takes information from a GFN1-xTB calculation as input to a graph-convolution (GC) NN to predict the difference between DFT and GFN1-xTB total energies. In conventional GC the molecule is typically represented by an adjacency matrix (a binary matrix where 1 indicates a bond) and a list of atomic and bond features, such as nuclear charges and bond orders, associated with each node and edge. This approach uses the diagonal and off-diagonal elements of matrices such as Fock, overlap, and density matrices from a GFN1-xTB calculation as node and edge features, respectively. 

The model gets state-of-the-art accuracies for QM9 total energies and the same model also gets excellent results for conformational energies from a different data set. Basically DFT level accuracy at semiempirical cost (it's not clear to me how it can be faster than the underlying GFN1-xTB calculation, but that might be down to different implementation of the GFN1-xTB method).

It's not clear to me weather the method can be used to optimise geometries, and thereby correct any deficiency in GFN1-xTB structures, and it's also not clear whether the code will be made available.