## Monday, September 28, 2020

### Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors

Yanfei Guan, Connor W. Coley, Haoyang Wu, Duminda Ranasinghe, EstherHeid, Thomas J. Struble, Lagnajit Pattanaik, William H. Green, and Klavs F. Jensen (2020)

Highlighted by Jan Jensen

Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license

The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.

Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.

The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.

## Monday, August 31, 2020

### A community-powered search of machine learning strategy space to find NMR property prediction models

Lars A. Bratholm, Will Gerrard, Brandon Anderson, Shaojie Bai, Sunghwan Choi, Lam Dang, Pavel Hanchar, AddisonHoward, Guillaume Huard, Sanghoon Kim, Zico Kolter, Risi Kondor, Mordechai Kornbluth, YouhanLee, Youngsoo Lee, Jonathan P. Mailoa, Thanh Tu Nguyen, Milos Popovic, Goran Rakocevic, Walter Reade, Wonho Song, Luka Stojanovic, Erik H. Thiede, Nebojsa Tijanic, Andres Torrubia, Devin Willmott, Craig P. Butts, David R. Glowacki, & Kaggle participants (2020)

Highlighted by Jan Jensen

Figure 1a and 1b from the paper (c) The authors. Reproduced under the CC-BY licence

Disclaimer: I was Lars Bratholms PhD advisor

This paper describes the results of a Kaggle competition called Champs for developing ML models that predict NMR coupling constants with DFT accuracy.

In a Kaggle competition the host of the competition provides a public training and test set. Participants use these datasets to develop ML models, which the site then evaluates on a private test set. The accuracy of each model is posted and the object of the competition is submit the most accurate model before the end of the competition. Competitors can submit as often as they want during the competition, which in this case lasted 3 months. The winners receive cash prices: in this case the top 5 models received \$12.5K, \$7.5K, \$5K, \$3K, and \\$2K, respectively.
"[Champs] received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemblemodel constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art."
Two of the top 5 teams had no domain specific expertise.

Is this the way of the future? Should any chemistry ML proposal include Kaggle prize money in the budget? I don't see any scientific reasons why not.