## Wednesday, April 27, 2022

### Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search

TOC picture from the paper (c) 2021 ACS

This paper tries to solve two problems at once: data augmentation for small data sets and a method-independent uncertainty quantification (UQ).

Data augmentation is quite common in areas like image classification where images can be perturbed (e.g. rotated by a few degrees) and still be recognisable. However, this is difficult in chemistry where small perturbations in structure can have a non-negligible effect on properties. For text-based molecular representation once can use non-canonical smiles for augmentation, but there is no generally applicable method.

Similarly, most UQ methods are specific to the machine learning model-type, with the exception of ensemble methods that requires the training and deployment of several models, which can be expensive.

The paper offers a simple solution to both. The method is trained to reproduce the ground truth difference for all $n^2$ molecule pairs thereby increasing the training set size significantly. When making a prediction for a new molecule, the model predicts the differences relative to all training set molecules with the standard deviation serving as a measure of prediction uncertainty. Pretty neat idea and easy to implement! The main change is to construct molecular representations for the molecule pairs but the authors outline one easy-to-implement approach.

Depending on the task and training set size the data augmentation decreases the MAE by 3-40%. UQ quality is notoriously difficult to quantify, but the method appears to give uncertainties similar to those obtained by a random forest method.

## Tuesday, March 29, 2022

### Machine Learning May Sometimes Simply Capture Literature Popularity Trends: A Case Study of Heterocyclic Suzuki−Miyaura Coupling

What do you infer from this quote from the paper (emphasis added)?

Another important problem, tackled herein, deals with the prediction of optimal conditions for a particular reaction in which there are generally multiple viable choices of solvents or reagents. Several works[21−24] have attempted to use ML for the prediction of reaction conditions, and the overall message they seem to convey is that ML can, in fact, offer accurate predictions provided adequate numbers of literature examples on which to build the models (but see also critical ref 6). However, here, we demonstrate with a case study that this may have been an overoptimistic interpretation, and that even with large quantities of carefully curated literature data, ML approaches may not perform considerably better than estimates based on the popularity of reaction conditions reported in the literature. In other words, these ML models do not provide significantly more insights than just suggesting the most popular conditions which could be obtained by simple statistics over literature examples[25,26] and no “machine intelligence.”
I can tell you what I inferred. References 21-24 used ML models to predict optimal reaction conditions, but failed to check whether they "provide significantly more insights than just suggesting the most popular conditions". I also inferred that the results from this study suggests that, had the authors checked, they would have found that not to be the case.

However, the four references refer to two papers (21 and 23) by Doyle and co-workers on the prediction of reaction yields (not conditions) and two papers, one by Coley and co-workers and one by Reisman and co-workers (22 and 24, respectively), on the prediction of reaction conditions with comparison to popularity baselines

The paper looks at the prediction of solvent and base (and not catalysts and temperature as implied by the TOC graphic above) for ca 10,000 Suzuki coupling reactions from Reaxys. The best top-1 accuracy for base and solvent for ML are 80.6% and 51.7%, compared to popularity baseline values of 76.8% and 29.8%. The authors use the term "significantly" (and related terms) without ever quantifying what they deem significant, but to me the ML solvent predictions seem significantly better than the popularity baseline.

Furthermore, as Coley and co-workers point out the true metric is the accuracy of the combined prediction, e.g. correct solvent and base. For example, in the case of correct catalysts and solvent and reagent Coley and co-workers found an accuracy of 57.3% compared to a popularity baseline of only 5.7%. However, I am not even certain whether Grzybowski and co-workers would deem that a significant improvement.

On a more constructive note, the topic of the paper does relate to an interesting fundamental question in ML on how to deal with imbalances data, i.e. where there is a a very popular single choice. One would perhaps naively suspect that this would be easier for a machine to learn, i.e. you just have to learn a few exceptions. But how to you typically learn exceptions? By memorising them, and we tend to employ many ML techniques to avoid just this.

## Monday, February 28, 2022

### Findings hits among billions of molecules

Assaf Alon, Jiankun Lyu, Joao M. Braz, Tia A. Tummino, Veronica Craik, Matthew J. O’Meara, Chase M. Webb, Dmytro S. Radchenko, Yurii S. Moroz, Xi-Ping Huang, Yongfeng Liu, Bryan L. Roth, John J. Irwin, Allan I. Basbaum, Brian K. Shoichet & Andrew C. Kruse. Structures of the σ2 receptor enable docking for bioactive ligand discovery (2021)

Arman A. Sadybekov, Anastasiia V. Sadybekov, Yongfeng Liu, Christos Iliopoulos-Tsoutsouvas, Xi-Ping Huang, Julie Pickett, Blake Houser, Nilkanth Patel, Ngan K. Tran, Fei Tong, Nikolai Zvonok, Manish K. Jain, Olena Savych, Dmytro S. Radchenko, Spyros P. Nikas, Nicos A. Petasis, Yurii S. Moroz, Bryan L. Roth, Alexandros Makriyannis & Vsevolod Katritch Synthon-based ligand discovery in virtual libraries of over 11 billion compounds (2021)

Highlighted by Jan Jensen

Figure 2a and b from Alon et al. (c) 2021 Nature

The recent developments in make-on-demand molecular libraries present an interesting methodological challenge to virtual screening. Not too long ago, such a library would have hundreds of millions and even 1 billion molecules and there was still a chance to dock a significant portion of these libraries. However, the sizes of the libraries have grown to well beyond 20 billion and show no sign of stopping. There is no way wholesale docking can keep up with this growth so new approaches are needed.

One computational approach that has kept up with the growth of make-on-demand libraries is similarity searching. It is still possible to search these enormous libraries for similar molecules in just a few minutes.

Alon et al. uses this general idea to select and dock 490 million molecules with properties that are similar to known binders to the target. Based on the docking scores they prioritised 577 molecules of which 484 were successfully made and 127 showed good activity against the target. 20,000 analogues of the four best candidates are then extracted from among 28 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 105 best candidates were made and tested leading to further improvement in the measured affinities.

Sadybekov et al. essentially docks the individual building blocks used in the make-on-demand library and then combined the best-scoring fragments into about 1 million molecules for a second round of docking. Using this approach they identified 80 promising candidates of which 60 could be synthesised. Of these 60 molecules, 21 proved active. 920 analogues of the three best candidates are then extracted from among 11 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 121 best candidates were made and tested leading to further improvement in the measured affinities.

There are several take home messages here.

The percentage of active compounds against a particular target in library is very small, so you don't get a lot of useful hits until you work with these enormous libraries.

Docking does help in identifying active compounds. Docking has a bad rep in certain circles and I have seen several people refer to them as "random number generators" but studies like these show that this is not the case. Sure, if one expects an excellent, or even respectable, correlation coefficient between docking scores and binding affinities, one will be sorely disappointed.  However, as these studies show, molecules with good docking scores have a much higher chance at being active than molecules with bad docking scores.

The success rate seems to be about 30-50% depending on the target. So if you are in the lower end and only able to make and test a handful of candidates (which is often the case for academic studies), there's a reasonable chance you won't find any actives and conclude that docking is useless. It's only when you are able to make and test dozens of molecules that you see that docking is working for you. The make-on-demand libraries now makes such numbers feasible for academics.

Finally, several of the co-authors on the two papers I highlight are Ukrainian and are, along with their families and friends, likely in grave danger right now as their country is being attacked by Putin and his ilk.

## Friday, January 28, 2022

### Machine learning potentials always extrapolate, it does not matter.

The Convex Hull (blue line) encloses the blue points. It maximises the area while minimising the circumference.

ML models are generally thought to only interpolate, but this paper suggests that this is not the case. On first sight this seems counterintuitive but on some reflection this may not be so strange at all.

First of all, the authors define an extrapolation as a prediction for a point outside (red point) the Convex Hull (blue line) defined by the training set points (blue points). They perform this analysis for three train/test sets related to solid state chemistry and show that between 80% and 100% of the test sets data points lie outside the Convex Hull defined by the training set data points, but ML models trained on the training set perform satisfactorily for the test set (hence the title).

While this might seem counterintuitive at first, is it really so strange that a model trained on the blue points performs better for the red point than the green point?  The red point is closer to the the blue points and there is really only extrapolation in the x direction.

The representation vectors used in this study all have at least 100 dimensions and a point is said to correspond to an extrapolation if it lies outside the Convex Hull in only one of these dimensions. By using PCA the authors show that in some cases extrapolation occurs for all test points when considering only the 10 most important dimensions, while 20 dimensions are needed for truly accurate results. However, for most cases reasonable accuracy can be obtained with 4 dimensions, where more than 90% of the test set is contained in the Convex Hull of the training set. So IMO the picture is not as clear cut as the title suggests.

The authors show that the best predictor of accuracy is the density of training set points in the region of the test set molecule.