## Friday, January 28, 2022

### Machine learning potentials always extrapolate, it does not matter.

The Convex Hull (blue line) encloses the blue points. It maximises the area while minimising the circumference.

ML models are generally thought to only interpolate, but this paper suggests that this is not the case. On first sight this seems counterintuitive but on some reflection this may not be so strange at all.

First of all, the authors define an extrapolation as a prediction for a point outside (red point) the Convex Hull (blue line) defined by the training set points (blue points). They perform this analysis for three train/test sets related to solid state chemistry and show that between 80% and 100% of the test sets data points lie outside the Convex Hull defined by the training set data points, but ML models trained on the training set perform satisfactorily for the test set (hence the title).

While this might seem counterintuitive at first, is it really so strange that a model trained on the blue points performs better for the red point than the green point?  The red point is closer to the the blue points and there is really only extrapolation in the x direction.

The representation vectors used in this study all have at least 100 dimensions and a point is said to correspond to an extrapolation if it lies outside the Convex Hull in only one of these dimensions. By using PCA the authors show that in some cases extrapolation occurs for all test points when considering only the 10 most important dimensions, while 20 dimensions are needed for truly accurate results. However, for most cases reasonable accuracy can be obtained with 4 dimensions, where more than 90% of the test set is contained in the Convex Hull of the training set. So IMO the picture is not as clear cut as the title suggests.

The authors show that the best predictor of accuracy is the density of training set points in the region of the test set molecule.

## Thursday, December 30, 2021

### Pushing the frontiers of density functionals by solving the fractional electron problem

Part of Figure 1 from the paper. (c) 2021 The authors

This paper presents a new ML-exchange-correlation potential that gives improved results compared to state-of-the-art functionals, especially for barriers. Most importantly, it demonstrates the importance of including fractional charge and spin in the training set when developing new functionals. Fractional charge-systems help reduce the self-interaction error while fractional spin-systems supplies information about static correlation. For example, the current functional gives reasonable bond dissociation curves and future functionals of this kind may work considerably better on transition metal-containing systems with significant multi-reference character.

## Friday, November 26, 2021

### Quantum harmonic free energies for biomolecules and nanomaterials

Figure 2 from the paper. (c) The authors. Reproduced under the CC-BY license.

This paper describes a method by which the harmonic vibrational free energy contributions can be accurately approximated at roughly 10% of the cost of a conventional Hessian calculation.

The equations for the vibrational free energy contributions are recast in terms of the trace of a matrix function (remember that the trace of a matrix is equal to the sum of its eigenvalues). This removes the need for matrix diagonalisation, which is costly for large matrices. Then they use a stochastic estimator of the trace where the trace is rewritten in terms of displacements along $n$ random vectors. The accuracy of free energy differences can be further increased by using the same random vectors for both reactants and products.

The accuracy of this approximation increases with the number of displacement vectors (and, hence, gradient evaluations) used. The authors tested in one several large systems, such as protein-ligand binding,  and found that sub-kcal/mol accuracy can be obtained at about 10% of the cost of a conventional Hessian calculation plus diagonalisation.

It is now quite common to scale the entropy contributions from small (<100 cm$^{-1}$) frequencies to get better numerical stability. I am not sure whether this is possible in the current approach since individual frequencies are not computed explicitly.

The code and data is "available upon reasonable request" 😕

## Sunday, October 31, 2021

### Explaining and avoiding failures modes in goal-directed generation

Figure 1 from the paper. (c) the authors 2021. Reproduced in the CC-BY-NC license

When you use search algorithms to optimise molecular properties predicted by ML-models, there is always the danger of going into regions of chemical space where the ML model no longer makes accurate predictions. Last year Renz et al. tried to quantify this phenomenon and basically concluded that it is a big problem. The current paper does not agree.

Renz et al. develop three different RF models as shown in the figure above for classifying bioactivity. In principle, all three models should give the same predictions. A search algorithm is then used to find molecules for which one of the models (the optimisation model) predict high scores, and these molecules are rescored using the other two control models. As the search proceed, these scores begin to diverge, leading Renz et al. to conclude that the search algorithms exploit biases particular to the optimisation model and does not, in fact, predict molecules that are truly active.

I almost highlighted this paper when it first appeared but was concerned by the relatively small sizes of the data sets used: 842, 667, and 842 molecules with 40, 140, and 59 active molecules, respectively. The paper by Langevin et al. suggests that this concern was justified.

First they created a holdout set of 10% of the molecules, and repeated the procedure by Renz et al. on the remaining 90%. They showed that the difference in performance for the holdout set are the same as those observed by Renz et al, i.e. these differences have to do with the models/training sets themselves and not necessarily with the search algorithms.

To show that it, in fact, has nothing to do with the search algorithms, they then demonstrated that the difference in model performance can be significantly reduced using two different approaches. One is to split the two data sets such that they are as similar as possible. Another is to use a better RF model: 200 trees and at least 3 samples per leaf, instead of 100 trees and 1 sample per leaf originally used by Renz et al.

## Thursday, September 30, 2021

### Benchmarking molecular feature attribution methods with activity cliffs

Figure 1 from the paper. (c) The authors 2021. Reproduced under the CC-BY-NC license.

This is a follow-up of sorts on a previous post on trying to explain ML models using feature attribution.  While the idea is very attractive it is not obvious how to best benchmark such methods for chemical applications since it's rarely clear what the right answer is. Most benchmarking so far has therefore been done on toy problems that basically amount to substructure identification.

This paper suggests that a solution to this is trying to identify activity cliffs in protein-ligand binding data, i.e. small structural changes that lead to large changes in binding affinity. The idea is that the atom attribution algorithms should identify these structural differences as illustrated in the figure above. The paper goes on to test this premise for an impressive number of feature attribution algorithms on an impressive number of datasets.

The main conclusion is that none of the methods work unless the molecule pairs are included in the training set! Thus the authors ...
"... discourage the overall use of modern feature attribution methods in prospective lead optimization applications, and particularly those that work in combination with message-passing neural networks."
However, this paper by Cruz-Monteagudo et al. argues that ML models in general should fail to predict activity cliffs. One way to view activity cliffs is as exceptions that test the rules and ML models are supposed to learn the rules. The only way to predict the exceptions is to memorise them (i.e. overfit).

On the other hand the examples shown above are, in my opinion, pretty drastic changes in structure that may not fit the conventional definition of activity cliffs and could conceivably be explained with learned rules. Clearly the feature attribution methods tested by Jiménez-Luna et al. are not up to the task. Or perhaps such methods require a larger training set to work. One key questions the authors didn't discuss is whether the ML models also fail to predict the change in binding affinity in addition to failing to correctly attribute the change.

## Saturday, August 28, 2021

### Evidential Deep Learning for Guided Molecular Property Prediction and Discovery

TOC figure from the paper. (c) 2021 The authors. Reproduced under the CC BY NC ND license

While knowing the uncertainty of a ML-predicted value is valuable, it is really only the Gaussian process method that delivers a rigorous estimate of this. If you want to use other ML methods such as NN you have to use more ad hoc methods like the ensemble or dropout methods and these only report of the uncertainty in the model parameters (if you retrain your model you'll get slightly different answers) and not on the uncertainty in the data (if you remeasure your data you'll get slightly different answers).

This paper presents a way to quantify both types of uncertainty for NN models (evidential learning). To apply it you change your output layer to output 4 values instead of 1 and you use a special loss function. One of the four output values is your prediction while the remaining 3 output values are plugged into a formula that gives you the uncertainty.

The paper compares this approach to the ensemble and dropout methods and shows that the evidential learning approach usually works better, i.e. there's a better correlation between the predicted uncertainty and the deviation from the ground truth. Note that it's a little tricky to quantify this correlation: if the error is random (which is the basic assumption behind all this) then the error can, by chance, be very small for a point with large uncertainty; it's just less likely compared to a point with low uncertainty.

The code is available here (note the link in the paper is wrong)

## Thursday, July 29, 2021

### Interactions between large molecules pose a puzzle for reference quantum mechanical methods

Figure 1 from the paper (c) The authors. Reproduced under the CC-BY licence

CCSD(T) and DMC are two gold-standard methods that should give the same results, and usually do. However, this study finds three systems for which the disagreement is unexpected large, up to 7.6 kcal/mol. It's not clear why and and it's not clear which method is correct. Since we use these methods to develop and benchmark other methods this is a real problem.

Now, there could be many reasons for the discrepancy and the authors have considered all of them and discounted most of them. The remaining reasons, such as higher order terms in the CC expansion, are practically impossible to check at presents. It also hard to believe that they would make such a large contributions to the interaction energy of two closed shell systems.

But there must be some reason for the discrepancy and when it is found we will most likely have learned something new about these methods.