Friday, November 26, 2021

Quantum harmonic free energies for biomolecules and nanomaterials

Alec F. White,  Chenghan Li, and Garnet Kin-Lic Chan (2021)
Highlighted by Jan Jensen

Figure 2 from the paper. (c) The authors. Reproduced under the CC-BY license.

This paper describes a method by which the harmonic vibrational free energy contributions can be accurately approximated at roughly 10% of the cost of a conventional Hessian calculation.

The equations for the vibrational free energy contributions are recast in terms of the trace of a matrix function (remember that the trace of a matrix is equal to the sum of its eigenvalues). This removes the need for matrix diagonalisation, which is costly for large matrices. Then they use a stochastic estimator of the trace where the trace is rewritten in terms of displacements along $n$ random vectors. The accuracy of free energy differences can be further increased by using the same random vectors for both reactants and products.

The accuracy of this approximation increases with the number of displacement vectors (and, hence, gradient evaluations) used. The authors tested in one several large systems, such as protein-ligand binding,  and found that sub-kcal/mol accuracy can be obtained at about 10% of the cost of a conventional Hessian calculation plus diagonalisation.

It is now quite common to scale the entropy contributions from small (<100 cm$^{-1}$) frequencies to get better numerical stability. I am not sure whether this is possible in the current approach since individual frequencies are not computed explicitly.

The code and data is "available upon reasonable request" 😕

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, October 31, 2021

Explaining and avoiding failures modes in goal-directed generation

Maxime Langevin, Rodolphe Vuilleumier, and Marc Bianciotto (2021) 
Highlighted by Jan Jensen

Figure 1 from the paper. (c) the authors 2021. Reproduced in the CC-BY-NC license

When you use search algorithms to optimise molecular properties predicted by ML-models, there is always the danger of going into regions of chemical space where the ML model no longer makes accurate predictions. Last year Renz et al. tried to quantify this phenomenon and basically concluded that it is a big problem. The current paper does not agree.

Renz et al. develop three different RF models as shown in the figure above for classifying bioactivity. In principle, all three models should give the same predictions. A search algorithm is then used to find molecules for which one of the models (the optimisation model) predict high scores, and these molecules are rescored using the other two control models. As the search proceed, these scores begin to diverge, leading Renz et al. to conclude that the search algorithms exploit biases particular to the optimisation model and does not, in fact, predict molecules that are truly active.

I almost highlighted this paper when it first appeared but was concerned by the relatively small sizes of the data sets used: 842, 667, and 842 molecules with 40, 140, and 59 active molecules, respectively. The paper by Langevin et al. suggests that this concern was justified.  

First they created a holdout set of 10% of the molecules, and repeated the procedure by Renz et al. on the remaining 90%. They showed that the difference in performance for the holdout set are the same as those observed by Renz et al, i.e. these differences have to do with the models/training sets themselves and not necessarily with the search algorithms. 

To show that it, in fact, has nothing to do with the search algorithms, they then demonstrated that the difference in model performance can be significantly reduced using two different approaches. One is to split the two data sets such that they are as similar as possible. Another is to use a better RF model: 200 trees and at least 3 samples per leaf, instead of 100 trees and 1 sample per leaf originally used by Renz et al.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, September 30, 2021

Benchmarking molecular feature attribution methods with activity cliffs

José Jiménez-Luna, Miha Skalic, and Nils Weskamp (2021)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) The authors 2021. Reproduced under the CC-BY-NC license.

This is a follow-up of sorts on a previous post on trying to explain ML models using feature attribution.  While the idea is very attractive it is not obvious how to best benchmark such methods for chemical applications since it's rarely clear what the right answer is. Most benchmarking so far has therefore been done on toy problems that basically amount to substructure identification. 

This paper suggests that a solution to this is trying to identify activity cliffs in protein-ligand binding data, i.e. small structural changes that lead to large changes in binding affinity. The idea is that the atom attribution algorithms should identify these structural differences as illustrated in the figure above. The paper goes on to test this premise for an impressive number of feature attribution algorithms on an impressive number of datasets. 

The main conclusion is that none of the methods work unless the molecule pairs are included in the training set! Thus the authors ...
"... discourage the overall use of modern feature attribution methods in prospective lead optimization applications, and particularly those that work in combination with message-passing neural networks."
However, this paper by Cruz-Monteagudo et al. argues that ML models in general should fail to predict activity cliffs. One way to view activity cliffs is as exceptions that test the rules and ML models are supposed to learn the rules. The only way to predict the exceptions is to memorise them (i.e. overfit). 

On the other hand the examples shown above are, in my opinion, pretty drastic changes in structure that may not fit the conventional definition of activity cliffs and could conceivably be explained with learned rules. Clearly the feature attribution methods tested by JimĂ©nez-Luna et al. are not up to the task. Or perhaps such methods require a larger training set to work. One key questions the authors didn't discuss is whether the ML models also fail to predict the change in binding affinity in addition to failing to correctly attribute the change.  

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, August 28, 2021

Evidential Deep Learning for Guided Molecular Property Prediction and Discovery

Ava P. Soleimany, Alexander Amini, Samuel Goldman, Daniela Rus, Sangeeta N. Bhatia, and Connor W. Coley 2021
Highlighted by Jan Jensen

TOC figure from the paper. (c) 2021 The authors. Reproduced under the CC BY NC ND license

While knowing the uncertainty of a ML-predicted value is valuable, it is really only the Gaussian process method that delivers a rigorous estimate of this. If you want to use other ML methods such as NN you have to use more ad hoc methods like the ensemble or dropout methods and these only report of the uncertainty in the model parameters (if you retrain your model you'll get slightly different answers) and not on the uncertainty in the data (if you remeasure your data you'll get slightly different answers).

This paper presents a way to quantify both types of uncertainty for NN models (evidential learning). To apply it you change your output layer to output 4 values instead of 1 and you use a special loss function. One of the four output values is your prediction while the remaining 3 output values are plugged into a formula that gives you the uncertainty.

The paper compares this approach to the ensemble and dropout methods and shows that the evidential learning approach usually works better, i.e. there's a better correlation between the predicted uncertainty and the deviation from the ground truth. Note that it's a little tricky to quantify this correlation: if the error is random (which is the basic assumption behind all this) then the error can, by chance, be very small for a point with large uncertainty; it's just less likely compared to a point with low uncertainty. 

The code is available here (note the link in the paper is wrong)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, July 29, 2021

Interactions between large molecules pose a puzzle for reference quantum mechanical methods

Yasmine S. Al-Hamdani, PĂ©ter R. Nagy, Andrea Zen, Dennis Barton, MihĂĄly KĂĄllay, Jan Gerit Brandenburg and  Alexandre Tkatchenko (2021)
Highlighted by Jan Jensen

Figure 1 from the paper (c) The authors. Reproduced under the CC-BY licence

CCSD(T) and DMC are two gold-standard methods that should give the same results, and usually do. However, this study finds three systems for which the disagreement is unexpected large, up to 7.6 kcal/mol. It's not clear why and and it's not clear which method is correct. Since we use these methods to develop and benchmark other methods this is a real problem. 

Now, there could be many reasons for the discrepancy and the authors have considered all of them and discounted most of them. The remaining reasons, such as higher order terms in the CC expansion, are practically impossible to check at presents. It also hard to believe that they would make such a large contributions to the interaction energy of two closed shell systems. 

But there must be some reason for the discrepancy and when it is found we will most likely have learned something new about these methods.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, June 28, 2021

Bayesian optimization of nanoporous materials

Aryan Deshwal, Cory M. Simon, and Janardhan Rao Doppa (2021)
Highlighted by Jan Jensen

Figure 5 from the paper. (c) the authors. Reproduced under the CC-BY license.

This is another example of searching chemical space for systems with extreme property values by continuously updating a surrogate ML model of the property. I wrote about another such example, by Graff et al., here, but the main new thing here (IMO) is the low number of property evaluations needed to train the surrogate model.

The property of interest is the methane deliverable capacity (y) of covalent organic frameworks (COFs) which has been predicted by expensive MD calculations for ca 70,000 COFs. Ten randomly selected datapoints are used to train a Gaussian Process (GP) surrogate model. Bayesian optimisation (BO) is then used to identify the COF that is most likely to improve the surrogate model (based on the predicted y-value and the uncertainty of of the prediction), which is re-evaluated using MD. The MD value then added to the training set and the process is repeated for up to 500 steps. 

Already after 100 steps (110 MD evaluations including the initial training set), the best COF is identified as are 25% of the top-100 COFs, which is quite impressive. For comparison, the smallest training set in the previous study by Graff et al. is 100 and they need a training set of 300 to get to 25%. On the other hand, Graff et al. get up to ca 70% of the top 100 with a training set of 500, compared to ca 50% in this study (but the chemical space of Graff et al. is only 10,000 so it's a bit hard to compare).

The main lesson (IMO) is that's it's worth trying to start with very small training sets for these approaches.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, May 28, 2021

Using attribution to decode binding mechanism in neural network models for chemistry

Kevin McCloskey, Ankur Talya, Federico Montia, Michael P. Brennera, and Lucy J. Colwella (2019)
Highlighted by Jan Jensen

Part of Figure 3. Red indicates atoms that make positive contributions to the predicted values.
Copyright (2019) National Academy of Sciences.

This paper shows that state-of-the-art ML models can easily be fooled even for relatively trivial classification problems. 

The authors generate several synthetic classification sets using simple rules, such as the presence of a phenyl group, and train both a graph-convolutional and message passing NN. Not surprisingly, the hold-out performance is near perfect with AUCs near 1.000. 

Then they use a technique called integrated gradients to compute atomic contributions to the predictions and check whether these contributions match the rules used to create the data sets. For example, if the ground truth rule is the presence of a benzene ring, then only benzene ring atoms should make significant positive contributions. For some ground truth rules, this is often not the case!

Figure 3A above shows a case where the ground truth rule is the presence of three groups: a phenyl, a primary amine, and an ether. While this model is correctly classified there are significant atomic contributions from some of the fused ring atoms. So either the atomic contributions are mis-assigned by the integrated gradients method or the prediction is correct for the wrong reasons. The authors argue that it is the latter because three atomic changes in and near the fused ring (Figure 3B) results in a molecule that the model mis-classifies.

The authors note:
It is dangerous to trust a model whose predictions one does not understand. A serious issue with neural networks is that, although a held-out test set may suggest that the model has learned to predict perfectly, there is no guarantee that the predictions are made for the right reason. Biases in the training set can easily cause errors in the model’s logic. The solution to this conundrum is to take the model seriously: Analyze it, ask it why it makes the predictions that it does, and avoid relying solely on aggregate accuracy metrics.
The integrated gradient (IG) method is interesting in and of itself, so a few more words on that:

JimĂ©nez-Luna et al. have since shown that the IG approach can be used to extract pharmacophores from models trained on experimental data sets. 

IG can only be applied to fully differentiable models such as NNs but Riniker and Landrum and Sheridan have developed fingerprint-based approaches that can be applied to any ML model but are theoretically more ad hoc. The Riniker-Landrum approach is available in RDKit while Jiménez-Luna et al. provide an implementation of the Sheridan approach, and also identify several examples where IG and the Sheridan approach gives different interpretations.