Monday, September 28, 2020

Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors

Yanfei Guan, Connor W. Coley, Haoyang Wu, Duminda Ranasinghe, EstherHeid, Thomas J. Struble, Lagnajit Pattanaik, William H. Green, and Klavs F. Jensen (2020)

Highlighted by Jan Jensen

Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license

The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.

Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.

The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.