Monday, August 31, 2020

A community-powered search of machine learning strategy space to find NMR property prediction models

Lars A. Bratholm, Will Gerrard, Brandon Anderson, Shaojie Bai, Sunghwan Choi, Lam Dang, Pavel Hanchar, AddisonHoward, Guillaume Huard, Sanghoon Kim, Zico Kolter, Risi Kondor, Mordechai Kornbluth, YouhanLee, Youngsoo Lee, Jonathan P. Mailoa, Thanh Tu Nguyen, Milos Popovic, Goran Rakocevic, Walter Reade, Wonho Song, Luka Stojanovic, Erik H. Thiede, Nebojsa Tijanic, Andres Torrubia, Devin Willmott, Craig P. Butts, David R. Glowacki, & Kaggle participants (2020)

Highlighted by Jan Jensen


Figure 1a and 1b from the paper (c) The authors. Reproduced under the CC-BY licence

Disclaimer: I was Lars Bratholms PhD advisor

This paper describes the results of a Kaggle competition called Champs for developing ML models that predict NMR coupling constants with DFT accuracy. 

In a Kaggle competition the host of the competition provides a public training and test set. Participants use these datasets to develop ML models, which the site then evaluates on a private test set. The accuracy of each model is posted and the object of the competition is submit the most accurate model before the end of the competition. Competitors can submit as often as they want during the competition, which in this case lasted 3 months. The winners receive cash prices: in this case the top 5 models received \$12.5K, \$7.5K, \$5K, \$3K, and \$2K, respectively.
"[Champs] received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemblemodel constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art."
Two of the top 5 teams had no domain specific expertise.

Is this the way of the future? Should any chemistry ML proposal include Kaggle prize money in the budget? I don't see any scientific reasons why not.

This work is licensed under a Creative Commons Attribution 4.0 International License.