## Wednesday, April 27, 2022

### Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search

TOC picture from the paper (c) 2021 ACS

This paper tries to solve two problems at once: data augmentation for small data sets and a method-independent uncertainty quantification (UQ).

Data augmentation is quite common in areas like image classification where images can be perturbed (e.g. rotated by a few degrees) and still be recognisable. However, this is difficult in chemistry where small perturbations in structure can have a non-negligible effect on properties. For text-based molecular representation once can use non-canonical smiles for augmentation, but there is no generally applicable method.

Similarly, most UQ methods are specific to the machine learning model-type, with the exception of ensemble methods that requires the training and deployment of several models, which can be expensive.

The paper offers a simple solution to both. The method is trained to reproduce the ground truth difference for all $n^2$ molecule pairs thereby increasing the training set size significantly. When making a prediction for a new molecule, the model predicts the differences relative to all training set molecules with the standard deviation serving as a measure of prediction uncertainty. Pretty neat idea and easy to implement! The main change is to construct molecular representations for the molecule pairs but the authors outline one easy-to-implement approach.

Depending on the task and training set size the data augmentation decreases the MAE by 3-40%. UQ quality is notoriously difficult to quantify, but the method appears to give uncertainties similar to those obtained by a random forest method.

## Tuesday, March 29, 2022

### Machine Learning May Sometimes Simply Capture Literature Popularity Trends: A Case Study of Heterocyclic Suzuki−Miyaura Coupling

What do you infer from this quote from the paper (emphasis added)?

Another important problem, tackled herein, deals with the prediction of optimal conditions for a particular reaction in which there are generally multiple viable choices of solvents or reagents. Several works[21−24] have attempted to use ML for the prediction of reaction conditions, and the overall message they seem to convey is that ML can, in fact, offer accurate predictions provided adequate numbers of literature examples on which to build the models (but see also critical ref 6). However, here, we demonstrate with a case study that this may have been an overoptimistic interpretation, and that even with large quantities of carefully curated literature data, ML approaches may not perform considerably better than estimates based on the popularity of reaction conditions reported in the literature. In other words, these ML models do not provide significantly more insights than just suggesting the most popular conditions which could be obtained by simple statistics over literature examples[25,26] and no “machine intelligence.”
I can tell you what I inferred. References 21-24 used ML models to predict optimal reaction conditions, but failed to check whether they "provide significantly more insights than just suggesting the most popular conditions". I also inferred that the results from this study suggests that, had the authors checked, they would have found that not to be the case.

However, the four references refer to two papers (21 and 23) by Doyle and co-workers on the prediction of reaction yields (not conditions) and two papers, one by Coley and co-workers and one by Reisman and co-workers (22 and 24, respectively), on the prediction of reaction conditions with comparison to popularity baselines

The paper looks at the prediction of solvent and base (and not catalysts and temperature as implied by the TOC graphic above) for ca 10,000 Suzuki coupling reactions from Reaxys. The best top-1 accuracy for base and solvent for ML are 80.6% and 51.7%, compared to popularity baseline values of 76.8% and 29.8%. The authors use the term "significantly" (and related terms) without ever quantifying what they deem significant, but to me the ML solvent predictions seem significantly better than the popularity baseline.

Furthermore, as Coley and co-workers point out the true metric is the accuracy of the combined prediction, e.g. correct solvent and base. For example, in the case of correct catalysts and solvent and reagent Coley and co-workers found an accuracy of 57.3% compared to a popularity baseline of only 5.7%. However, I am not even certain whether Grzybowski and co-workers would deem that a significant improvement.

On a more constructive note, the topic of the paper does relate to an interesting fundamental question in ML on how to deal with imbalances data, i.e. where there is a a very popular single choice. One would perhaps naively suspect that this would be easier for a machine to learn, i.e. you just have to learn a few exceptions. But how to you typically learn exceptions? By memorising them, and we tend to employ many ML techniques to avoid just this.

## Monday, February 28, 2022

### Findings hits among billions of molecules

Assaf Alon, Jiankun Lyu, Joao M. Braz, Tia A. Tummino, Veronica Craik, Matthew J. O’Meara, Chase M. Webb, Dmytro S. Radchenko, Yurii S. Moroz, Xi-Ping Huang, Yongfeng Liu, Bryan L. Roth, John J. Irwin, Allan I. Basbaum, Brian K. Shoichet & Andrew C. Kruse. Structures of the σ2 receptor enable docking for bioactive ligand discovery (2021)

Arman A. Sadybekov, Anastasiia V. Sadybekov, Yongfeng Liu, Christos Iliopoulos-Tsoutsouvas, Xi-Ping Huang, Julie Pickett, Blake Houser, Nilkanth Patel, Ngan K. Tran, Fei Tong, Nikolai Zvonok, Manish K. Jain, Olena Savych, Dmytro S. Radchenko, Spyros P. Nikas, Nicos A. Petasis, Yurii S. Moroz, Bryan L. Roth, Alexandros Makriyannis & Vsevolod Katritch Synthon-based ligand discovery in virtual libraries of over 11 billion compounds (2021)

Highlighted by Jan Jensen

Figure 2a and b from Alon et al. (c) 2021 Nature

The recent developments in make-on-demand molecular libraries present an interesting methodological challenge to virtual screening. Not too long ago, such a library would have hundreds of millions and even 1 billion molecules and there was still a chance to dock a significant portion of these libraries. However, the sizes of the libraries have grown to well beyond 20 billion and show no sign of stopping. There is no way wholesale docking can keep up with this growth so new approaches are needed.

One computational approach that has kept up with the growth of make-on-demand libraries is similarity searching. It is still possible to search these enormous libraries for similar molecules in just a few minutes.

Alon et al. uses this general idea to select and dock 490 million molecules with properties that are similar to known binders to the target. Based on the docking scores they prioritised 577 molecules of which 484 were successfully made and 127 showed good activity against the target. 20,000 analogues of the four best candidates are then extracted from among 28 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 105 best candidates were made and tested leading to further improvement in the measured affinities.

Sadybekov et al. essentially docks the individual building blocks used in the make-on-demand library and then combined the best-scoring fragments into about 1 million molecules for a second round of docking. Using this approach they identified 80 promising candidates of which 60 could be synthesised. Of these 60 molecules, 21 proved active. 920 analogues of the three best candidates are then extracted from among 11 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 121 best candidates were made and tested leading to further improvement in the measured affinities.

There are several take home messages here.

The percentage of active compounds against a particular target in library is very small, so you don't get a lot of useful hits until you work with these enormous libraries.

Docking does help in identifying active compounds. Docking has a bad rep in certain circles and I have seen several people refer to them as "random number generators" but studies like these show that this is not the case. Sure, if one expects an excellent, or even respectable, correlation coefficient between docking scores and binding affinities, one will be sorely disappointed.  However, as these studies show, molecules with good docking scores have a much higher chance at being active than molecules with bad docking scores.

The success rate seems to be about 30-50% depending on the target. So if you are in the lower end and only able to make and test a handful of candidates (which is often the case for academic studies), there's a reasonable chance you won't find any actives and conclude that docking is useless. It's only when you are able to make and test dozens of molecules that you see that docking is working for you. The make-on-demand libraries now makes such numbers feasible for academics.

Finally, several of the co-authors on the two papers I highlight are Ukrainian and are, along with their families and friends, likely in grave danger right now as their country is being attacked by Putin and his ilk.

## Friday, January 28, 2022

### Machine learning potentials always extrapolate, it does not matter.

The Convex Hull (blue line) encloses the blue points. It maximises the area while minimising the circumference.

ML models are generally thought to only interpolate, but this paper suggests that this is not the case. On first sight this seems counterintuitive but on some reflection this may not be so strange at all.

First of all, the authors define an extrapolation as a prediction for a point outside (red point) the Convex Hull (blue line) defined by the training set points (blue points). They perform this analysis for three train/test sets related to solid state chemistry and show that between 80% and 100% of the test sets data points lie outside the Convex Hull defined by the training set data points, but ML models trained on the training set perform satisfactorily for the test set (hence the title).

While this might seem counterintuitive at first, is it really so strange that a model trained on the blue points performs better for the red point than the green point?  The red point is closer to the the blue points and there is really only extrapolation in the x direction.

The representation vectors used in this study all have at least 100 dimensions and a point is said to correspond to an extrapolation if it lies outside the Convex Hull in only one of these dimensions. By using PCA the authors show that in some cases extrapolation occurs for all test points when considering only the 10 most important dimensions, while 20 dimensions are needed for truly accurate results. However, for most cases reasonable accuracy can be obtained with 4 dimensions, where more than 90% of the test set is contained in the Convex Hull of the training set. So IMO the picture is not as clear cut as the title suggests.

The authors show that the best predictor of accuracy is the density of training set points in the region of the test set molecule.

## Thursday, December 30, 2021

### Pushing the frontiers of density functionals by solving the fractional electron problem

Part of Figure 1 from the paper. (c) 2021 The authors

This paper presents a new ML-exchange-correlation potential that gives improved results compared to state-of-the-art functionals, especially for barriers. Most importantly, it demonstrates the importance of including fractional charge and spin in the training set when developing new functionals. Fractional charge-systems help reduce the self-interaction error while fractional spin-systems supplies information about static correlation. For example, the current functional gives reasonable bond dissociation curves and future functionals of this kind may work considerably better on transition metal-containing systems with significant multi-reference character.

## Friday, November 26, 2021

### Quantum harmonic free energies for biomolecules and nanomaterials

Figure 2 from the paper. (c) The authors. Reproduced under the CC-BY license.

This paper describes a method by which the harmonic vibrational free energy contributions can be accurately approximated at roughly 10% of the cost of a conventional Hessian calculation.

The equations for the vibrational free energy contributions are recast in terms of the trace of a matrix function (remember that the trace of a matrix is equal to the sum of its eigenvalues). This removes the need for matrix diagonalisation, which is costly for large matrices. Then they use a stochastic estimator of the trace where the trace is rewritten in terms of displacements along $n$ random vectors. The accuracy of free energy differences can be further increased by using the same random vectors for both reactants and products.

The accuracy of this approximation increases with the number of displacement vectors (and, hence, gradient evaluations) used. The authors tested in one several large systems, such as protein-ligand binding,  and found that sub-kcal/mol accuracy can be obtained at about 10% of the cost of a conventional Hessian calculation plus diagonalisation.

It is now quite common to scale the entropy contributions from small (<100 cm$^{-1}$) frequencies to get better numerical stability. I am not sure whether this is possible in the current approach since individual frequencies are not computed explicitly.

The code and data is "available upon reasonable request" 😕

## Sunday, October 31, 2021

### Explaining and avoiding failures modes in goal-directed generation

Figure 1 from the paper. (c) the authors 2021. Reproduced in the CC-BY-NC license

When you use search algorithms to optimise molecular properties predicted by ML-models, there is always the danger of going into regions of chemical space where the ML model no longer makes accurate predictions. Last year Renz et al. tried to quantify this phenomenon and basically concluded that it is a big problem. The current paper does not agree.

Renz et al. develop three different RF models as shown in the figure above for classifying bioactivity. In principle, all three models should give the same predictions. A search algorithm is then used to find molecules for which one of the models (the optimisation model) predict high scores, and these molecules are rescored using the other two control models. As the search proceed, these scores begin to diverge, leading Renz et al. to conclude that the search algorithms exploit biases particular to the optimisation model and does not, in fact, predict molecules that are truly active.

I almost highlighted this paper when it first appeared but was concerned by the relatively small sizes of the data sets used: 842, 667, and 842 molecules with 40, 140, and 59 active molecules, respectively. The paper by Langevin et al. suggests that this concern was justified.

First they created a holdout set of 10% of the molecules, and repeated the procedure by Renz et al. on the remaining 90%. They showed that the difference in performance for the holdout set are the same as those observed by Renz et al, i.e. these differences have to do with the models/training sets themselves and not necessarily with the search algorithms.

To show that it, in fact, has nothing to do with the search algorithms, they then demonstrated that the difference in model performance can be significantly reduced using two different approaches. One is to split the two data sets such that they are as similar as possible. Another is to use a better RF model: 200 trees and at least 3 samples per leaf, instead of 100 trees and 1 sample per leaf originally used by Renz et al.