Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley 2020

Highlighted by Jan Jensen

This work is licensed under a Creative Commons Attribution 4.0 International License.

Important recent papers in computational and theoretical chemistry

A free resource for scientists run by scientists

Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley 2020

Highlighted by Jan Jensen

Figure 3 from the paper. (c) American Chemical Society 2020

Given the blackbox nature of ML models it is very important to have some measure of how much to trust their predictions. There are many ways to do this paper shows "none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple data sets."

This conclusion is neatly summarised in the figure shown above for 5 common datasets, 2 different ML methods, and 4 different methods for uncertainty quantification. For each combination of these the plot shows the RMSE for for the 100, 50, 25, 10, and 5% of the test set on which the uncertainty quantification method calculated the lowest uncertainty for the hold-out set.

Generally, the RMSE drops as expected but the drops are in many cases decidedly modest past 50% and it can even increase in some cases. In most cases there is very little difference between the different uncertainty quantification methods, but sometimes there is and it's hard to predict when.

One thing that struck me when reading this paper is that many studies who include uncertainty quantification, e.g. using the ensemble approach, often just take it for granted that it works and don't present tests like this.

This work is licensed under a Creative Commons Attribution 4.0 International License.

David E. Graff, Eugene I. Shakhnovich, and Connor W. Coley (2020)

Highlighted by Jan Jensen

This paper shows how to find the highest scoring molecules in a very large library of molecules by scoring only a very small percentage of the library. The focus of the paper is docking scores but it can in principle to be used for any molecular property.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Highlighted by Jan Jensen

Figure 1 and part of Figure 2 from the paper. (c) The authors 2021.

This paper shows how to find the highest scoring molecules in a very large library of molecules by scoring only a very small percentage of the library. The focus of the paper is docking scores but it can in principle to be used for any molecular property.

The general approach is simple:

1. Start by picking a random sample of the library (say 100 molecules out of a library of 10.000 molecules) and evaluate their scores.

2. Use these 100 points to train a machine-learning (ML) model to predict the scores.

3. Screen all 10,000 molecules using the ML model. The assumption is that training/using the ML model is much cheaper than evaluating the score.

4. Select the 100 best molecules according to the ML model, compute the scores, and use them to retrain the ML model.

5. Repeat steps 3 and 4.

The best molecules could be the best-scoring molecules (this a known as "greedy" optimisation). However, if the uncertainty of the ML prediction for each molecule can be quantified, there are several other options for what best is (use of these approaches are referred to as Bayesian optimisation). The study investigates four selection functions involving standard deviations but finds the greedy approach works best.

The best molecules could be the best-scoring molecules (this a known as "greedy" optimisation). However, if the uncertainty of the ML prediction for each molecule can be quantified, there are several other options for what best is (use of these approaches are referred to as Bayesian optimisation). The study investigates four selection functions involving standard deviations but finds the greedy approach works best.

The approach is tested on three different datasets with known docking scores of varying sizes (10K, 50K, 2M, and 99M). The study tests three different machine learning models: RF and NN using fingerprints as well as a graph convolutional model (which works best) and various choices batch sizes.

In the case of the 99M dataset more than half of the top-50,000 scoring molecules can be found by docking only 600K molecules using this approach.

However, let's turn that last sentence around: if you're developing an ML model to find high-scoring molecules your training set size needs to be 600K. Furthermore, the study shows that if you just pick 500K random molecules for your training set, your ML model won't identify any of the top-50,0000 molecules. You *have* to build this very large training set in this iterative fashion to get an ML model that can reliably identify the top-scoring molecules.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Labels:
docking,
drug design,
jjensen,
machine learning

Cynthia Shen, Mario Krenn, Sagi Eppel, Alan Aspuru-Guzik (2020)

Highlighted by Jan Jensen

This work is licensed under a Creative Commons Attribution 4.0 International License.

Highlighted by Jan Jensen

Figure 2 from the paper. (c) the authors 2020. Reproduced under the CC-BY license

This paper presents an interesting approach for obtaining molecules with particular properties. A 4-layer NN is trained to predict logP values based on a one-hot encoded string representation (SELFIES) of molecules. The NN is trained in the usual way: a molecule is input, the predicted logP value is compared to the true value, and the NN weights are adjusted then adjusted to minimise the the difference - a process that is repeated for a certain number of epochs.

Once trained, the process is then reversed. A target logP value is chosen together with an arbitrary molecule. The difference in predicted and target logP value is then minimised by adjusting the one-hot encoded representation of the molecule - a process that is repeated for a certain number of epochs.

In both cases the adjustments are done based the gradient of the error with respect to weights (in the first case) and the one-hot encoded vectors (in the second case). Since the start vector is binary, but is changed to a real number vector after the optimisation starts there are some convergence problems. The authors show that this can be addressed by changing the 0's in the one-hot encoding randomly to some number between 0 and a maximum value.

Since selfies are being used, every vector representation can be resolved to a molecule, which means that one can also analyse the optimisation path to gain insight into how the NN translates molecules into a property prediction.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Highlighted by Jan Jensen

Figure 4 from the paper. (c) 2020 The authors. Reproduced under the CC BY-NC-ND 4.0 license.

While there are many ML-based design studies in the literature it is quite rare so see one with experimental verification. The Open Source Malaria (OSM) made two rounds of antimalarial activity data available and invited researchers to use this data to develop predictive models and identify molecules with high potency for synthesis. Here I'll focus on the second round that started in 2019, were the participants worked with a ~400 compound dataset.

Here 10 teams from both industry and academia submitted models (classifiers) that were judged by a panel of experts using a held-back dataset. The four teams with the highest scoring models (with a precision between 81% and 91%) were then asked to submit two new molecules each for experimental verification: one possessing a triazolopyrazine core and one without. However, the latter compounds all proved synthetically inaccessible, as did two with the triazolopyrazine core. Thus, a total of six molecules were synthesised and tested and ...

Three of the six compounds were found to be active(<1 μM) or moderately active (1–2.5 μM) in in vitro growth assays with asexual blood-stageP. falciparum(3D7) parasites, representing a hit rate of 50% on a small sample size. Up to this point a total of 398 compounds had been made and evaluated for in vitro activity in OSM Series 4, with the design of these compounds driven entirely by the intuition of medicinal chemists. By setting a potency cut-off of 2.5 μM (the upper limit of reasonable activity), the tally of active compounds discovered in this series stands at 165, representing a comparable human intuition-derived hit rate of 41% on a larger sample size.

Interestingly, the Optibrium/Intelligens candidate was initially met with a great deal of scepticism by the expert panel but turned out to be the best overall candidate.

Christopher Sutton, Mario Boley, Luca M. Ghiringhelli, Matthias Rupp, Jilles Vreeken, Matthias Scheffler (2020)

This paper applies subgroup discovery (SGD) to detect domain applicability (DA) of three ML models for predicting formation energies of certain solid state materials. The authors define several DA features such as unit cell dimensions, composition, and interatomic distances. The features are different than the (much more complex) representations used as input to the ML models. The SGD algorithm then uses the DA features together with the ML-model errors to determine a selector (σ*f*) by finding the largest possible subgroup of molecular systems (coverage) with the lowest possible error.

Figure 3 from the paper (c) The authors 2020. Reproduced under the CC-BY license

This paper applies subgroup discovery (SGD) to detect domain applicability (DA) of three ML models for predicting formation energies of certain solid state materials. The authors define several DA features such as unit cell dimensions, composition, and interatomic distances. The features are different than the (much more complex) representations used as input to the ML models. The SGD algorithm then uses the DA features together with the ML-model errors to determine a selector (σ

The selector is a definition of this subgroup in terms of the some of the DA features, which are automatically chosen by the SGD algorithm. For example, the DA of one of the models is defined by three DA features:

where "^" means "and". The MAE for this DA is 7.6 meV/cation, compared to 14.2 meV/cation for the test set used to train the ML model.

Interestingly, the three ML models this analysis was applied to had virtually the same overall MAEs but different DAs and quite different MAEs within each domain. Also, the coverage of each DA varied considerably.

The SGD method appears to be a very useful and generally applicable tool for ML. The SGD algorithm used for this study is freely available here.

Highlighted by Jan Jensen

Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license

The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.

Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.

The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.

Highlighted by Jan Jensen

Figure 1a and 1b from the paper (c) The authors. Reproduced under the CC-BY licence

Disclaimer: I was Lars Bratholms PhD advisor

This paper describes the results of a Kaggle competition called Champs for developing ML models that predict NMR coupling constants with DFT accuracy.

In a Kaggle competition the host of the competition provides a public training and test set. Participants use these datasets to develop ML models, which the site then evaluates on a private test set. The accuracy of each model is posted and the object of the competition is submit the most accurate model before the end of the competition. Competitors can submit as often as they want during the competition, which in this case lasted 3 months. The winners receive cash prices: in this case the top 5 models received \$12.5K, \$7.5K, \$5K, \$3K, and \$2K, respectively.

"[Champs] received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemblemodel constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art."

Is this the way of the future? Should any chemistry ML proposal include Kaggle prize money in the budget? I don't see any scientific reasons why not.

This work is licensed under a Creative Commons Attribution 4.0 International License.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Subscribe to:
Posts (Atom)