Saturday, August 28, 2021

Evidential Deep Learning for Guided Molecular Property Prediction and Discovery

Ava P. Soleimany, Alexander Amini, Samuel Goldman, Daniela Rus, Sangeeta N. Bhatia, and Connor W. Coley 2021
Highlighted by Jan Jensen

TOC figure from the paper. (c) 2021 The authors. Reproduced under the CC BY NC ND license

While knowing the uncertainty of a ML-predicted value is valuable, it is really only the Gaussian process method that delivers a rigorous estimate of this. If you want to use other ML methods such as NN you have to use more ad hoc methods like the ensemble or dropout methods and these only report of the uncertainty in the model parameters (if you retrain your model you'll get slightly different answers) and not on the uncertainty in the data (if you remeasure your data you'll get slightly different answers).

This paper presents a way to quantify both types of uncertainty for NN models (evidential learning). To apply it you change your output layer to output 4 values instead of 1 and you use a special loss function. One of the four output values is your prediction while the remaining 3 output values are plugged into a formula that gives you the uncertainty.

The paper compares this approach to the ensemble and dropout methods and shows that the evidential learning approach usually works better, i.e. there's a better correlation between the predicted uncertainty and the deviation from the ground truth. Note that it's a little tricky to quantify this correlation: if the error is random (which is the basic assumption behind all this) then the error can, by chance, be very small for a point with large uncertainty; it's just less likely compared to a point with low uncertainty. 

The code is available here (note the link in the paper is wrong)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, July 29, 2021

Interactions between large molecules pose a puzzle for reference quantum mechanical methods

Yasmine S. Al-Hamdani, Péter R. Nagy, Andrea Zen, Dennis Barton, Mihály Kállay, Jan Gerit Brandenburg and  Alexandre Tkatchenko (2021)
Highlighted by Jan Jensen

Figure 1 from the paper (c) The authors. Reproduced under the CC-BY licence

CCSD(T) and DMC are two gold-standard methods that should give the same results, and usually do. However, this study finds three systems for which the disagreement is unexpected large, up to 7.6 kcal/mol. It's not clear why and and it's not clear which method is correct. Since we use these methods to develop and benchmark other methods this is a real problem. 

Now, there could be many reasons for the discrepancy and the authors have considered all of them and discounted most of them. The remaining reasons, such as higher order terms in the CC expansion, are practically impossible to check at presents. It also hard to believe that they would make such a large contributions to the interaction energy of two closed shell systems. 

But there must be some reason for the discrepancy and when it is found we will most likely have learned something new about these methods.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, June 28, 2021

Bayesian optimization of nanoporous materials

Aryan Deshwal, Cory M. Simon, and Janardhan Rao Doppa (2021)
Highlighted by Jan Jensen

Figure 5 from the paper. (c) the authors. Reproduced under the CC-BY license.

This is another example of searching chemical space for systems with extreme property values by continuously updating a surrogate ML model of the property. I wrote about another such example, by Graff et al., here, but the main new thing here (IMO) is the low number of property evaluations needed to train the surrogate model.

The property of interest is the methane deliverable capacity (y) of covalent organic frameworks (COFs) which has been predicted by expensive MD calculations for ca 70,000 COFs. Ten randomly selected datapoints are used to train a Gaussian Process (GP) surrogate model. Bayesian optimisation (BO) is then used to identify the COF that is most likely to improve the surrogate model (based on the predicted y-value and the uncertainty of of the prediction), which is re-evaluated using MD. The MD value then added to the training set and the process is repeated for up to 500 steps. 

Already after 100 steps (110 MD evaluations including the initial training set), the best COF is identified as are 25% of the top-100 COFs, which is quite impressive. For comparison, the smallest training set in the previous study by Graff et al. is 100 and they need a training set of 300 to get to 25%. On the other hand, Graff et al. get up to ca 70% of the top 100 with a training set of 500, compared to ca 50% in this study (but the chemical space of Graff et al. is only 10,000 so it's a bit hard to compare).

The main lesson (IMO) is that's it's worth trying to start with very small training sets for these approaches.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, May 28, 2021

Using attribution to decode binding mechanism in neural network models for chemistry

Kevin McCloskey, Ankur Talya, Federico Montia, Michael P. Brennera, and Lucy J. Colwella (2019)
Highlighted by Jan Jensen

Part of Figure 3. Red indicates atoms that make positive contributions to the predicted values.
Copyright (2019) National Academy of Sciences.

This paper shows that state-of-the-art ML models can easily be fooled even for relatively trivial classification problems. 

The authors generate several synthetic classification sets using simple rules, such as the presence of a phenyl group, and train both a graph-convolutional and message passing NN. Not surprisingly, the hold-out performance is near perfect with AUCs near 1.000. 

Then they use a technique called integrated gradients to compute atomic contributions to the predictions and check whether these contributions match the rules used to create the data sets. For example, if the ground truth rule is the presence of a benzene ring, then only benzene ring atoms should make significant positive contributions. For some ground truth rules, this is often not the case!

Figure 3A above shows a case where the ground truth rule is the presence of three groups: a phenyl, a primary amine, and an ether. While this model is correctly classified there are significant atomic contributions from some of the fused ring atoms. So either the atomic contributions are mis-assigned by the integrated gradients method or the prediction is correct for the wrong reasons. The authors argue that it is the latter because three atomic changes in and near the fused ring (Figure 3B) results in a molecule that the model mis-classifies.

The authors note:
It is dangerous to trust a model whose predictions one does not understand. A serious issue with neural networks is that, although a held-out test set may suggest that the model has learned to predict perfectly, there is no guarantee that the predictions are made for the right reason. Biases in the training set can easily cause errors in the model’s logic. The solution to this conundrum is to take the model seriously: Analyze it, ask it why it makes the predictions that it does, and avoid relying solely on aggregate accuracy metrics.
The integrated gradient (IG) method is interesting in and of itself, so a few more words on that:

Jiménez-Luna et al. have since shown that the IG approach can be used to extract pharmacophores from models trained on experimental data sets. 

IG can only be applied to fully differentiable models such as NNs but Riniker and Landrum and Sheridan have developed fingerprint-based approaches that can be applied to any ML model but are theoretically more ad hoc. The Riniker-Landrum approach is available in RDKit while Jiménez-Luna et al. provide an implementation of the Sheridan approach, and also identify several examples where IG and the Sheridan approach gives different interpretations.

Friday, April 30, 2021

ChemDyME: Kinetically Steered, Automated Mechanism Generation Through Combined Molecular Dynamics and Master Equation Calculations

Robin J. Shannon, Emilio Martínez-Núñez, Dmitrii V. Shalashilin, David R. Glowacki (2021)
Highlighted by Jan Jensen

Figure 1 from the paper (c) The authors 2021. Reproduced under the CC-BY license

ChemDyME couples metadynamics statistical  rate  theory to automatically find kinetically important reactions and then solve the time evolution of the species in the evolving network. 

There are three steps as shown in the figure above:

1. Molecular Dynamics (MD) where semiempirical metadynamics simulations are used to identify products that are likely to be connected to the reactant by low barriers. Specifically the boxed MD (BXD) metadynamics method where an extra term to the atomic velocities to steer the MD away from previously explored regions of configuration space. The MD stops when changes in atomic connectivity is detected.

2. Optimisation and Refinement (OR) where the products structures are optimised an the TSs to the reactant are located at a higher level of theory. The initial guess for the TS geometry is the first structure in the trajectory where the atomic connectivity changes. If that approach fails a spline-based reaction path method is used. The TSs are verified by IRCs.

3. Master Equation (ME) where the set of coupled kinetic equations are solved numerically. As the reaction network grows this can become computationally demanding, which is a problem when it is done on-the-fly. The authors therefore employ the Boxed Molecular Kinetics approach to speed things up.

These steps are then repeated using the kinetically most accessible product (identified by the ME step) as the reactant. The entire procedure is then repeated until a desired maximum reaction time is reached.

The authors test the procedure on two well studies systems and show that the procedure indeed identifies the most important reactions in the reaction network.

Disclaimer: My group has developed a similar approach for the first two steps.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, March 30, 2021

Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design

Brian Hie, Bryan D. Bryson, Bonnie Berger (2021)
Highlighted by Jan Jensen

Part of the TOC figure. (c) The authors. Reproduced under the CC-BY-NC-ND license

It is rare to see successful ML studies with training sets of 72 molecules, but this is one such study. 

The data set is 72 compounds with measured binding affinities to 442 different kinases, i.e. 32K datapoints. The Kds span a range of 0.1 nm to 10 μm. This data is used to train several ML models, some of which include uncertainty estimation and some which do not. The main finding is that for points with low uncertainties the ML model is better at separating active from inactive compounds. Interestingly, compounds with low uncertainty have extreme Kds.

The models (retrained on the whole dataset) is then used to screen a set of 10,833 purchasable compounds. The top five candidates for each model where purchased and checked against four different kinases (i.e. 20 ligand-kinase pairs per model). For the uncertainty based ML models the top candidates are molecules with both low Kd and low uncertainty, while for the other models the decision was made solely based on Kd.

None of the molecules picked solely based on Kd showed at Kd less than 10 μm, whereas 18, 10, and 2 ligand-kinase pairs had Kds lower than 10 μm, 100 nm, and 1 nm, respectively. 

Some ML details: The molecules where featurised using a graph convolutional junction tree approach (JTNN-VAE), which was found to work better han fingerprints (data not shown). The uncertainty is predicted using four different approaches: GP regression (GP), GP regression of the MLP error (MLP+GP), Bayesian multilayer perceptron (BMLP) and an ensemble of MLPs that each emits a Gaussian distribution (GMLPE). Nigam et al. recently published a very nice overview of such methods.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, February 28, 2021

Uncertainty Quantification Using Neural Networks for Molecular Property Prediction

Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley 2020
Highlighted by Jan Jensen

Figure 3 from the paper. (c) American Chemical Society 2020

Given the blackbox nature of ML models it is very important to have some measure of how much to trust their predictions. There are many ways to do this paper shows "none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple data sets."

This conclusion is neatly summarised in the figure shown above for 5 common datasets, 2 different ML methods, and 4 different methods for uncertainty quantification. For each combination of these the plot shows the RMSE for for the 100, 50, 25, 10, and 5% of the test set on which the uncertainty quantification method calculated the lowest uncertainty for the hold-out set.

Generally, the RMSE drops as expected but the drops are in many cases decidedly modest past 50% and it can even increase in some cases. In most cases there is very little difference between the different uncertainty quantification methods, but sometimes there is and it's hard to predict when.

One thing that struck me when reading this paper is that many studies who include uncertainty quantification, e.g. using the ensemble approach, often just take it for granted that it works and don't present tests like this.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, January 30, 2021

Accelerating High-Throughput Virtual Screening Through Molecular Pool-Based Active Learning

David E. Graff, Eugene I. Shakhnovich, and Connor W. Coley (2020)
Highlighted by Jan Jensen

Figure 1 and part of Figure 2 from the paper. (c) The authors 2021. 

This paper shows how to find the highest scoring molecules in a very large library of molecules by scoring only a very small percentage of the library. The focus of the paper is docking scores but it can in principle to be used for any molecular property.

The general approach is simple: 

1. Start by picking a random sample of the library (say 100 molecules out of a library of 10.000 molecules) and evaluate their scores. 

2. Use these 100 points to train a machine-learning (ML) model to predict the scores.

3. Screen all 10,000 molecules using the ML model. The assumption is that training/using the ML model is much cheaper than evaluating the score.

4. Select the 100 best molecules according to the ML model, compute the scores, and use them to retrain the ML model.

5. Repeat steps 3 and 4.

The best molecules could be the best-scoring molecules (this a known as "greedy" optimisation). However, if the uncertainty of the ML prediction for each molecule can be quantified, there are several other options for what best is (use of these approaches are referred to as Bayesian optimisation).  The study investigates four selection functions involving standard deviations but finds the greedy approach works best.

The approach is tested on three different datasets with known docking scores of varying sizes (10K, 50K, 2M, and 99M). The study tests three different machine learning models: RF and NN using fingerprints as well as a graph convolutional model (which works best) and various choices batch sizes. 

In the case of the 99M dataset more than half of the top-50,000 scoring molecules can be found by docking only 600K molecules using this approach. 

However, let's turn that last sentence around: if you're developing an ML model to find high-scoring molecules your training set size needs to be 600K. Furthermore, the study shows that if you just pick 500K random molecules for your training set, your ML model won't identify any of the top-50,0000 molecules. You have to build this very large training set in this iterative fashion to get an ML model that can reliably identify the top-scoring molecules.