## Wednesday, December 30, 2020

### Deep Molecular Dreaming: Inverse machine learning for de-novo molecular design and interpretability with surjective representations

Cynthia Shen, Mario Krenn, Sagi Eppel, Alan Aspuru-Guzik (2020)
Highlighted by Jan Jensen

Figure 2 from the paper. (c) the authors 2020. Reproduced under the CC-BY license

This paper presents an interesting approach for obtaining molecules with particular properties. A 4-layer NN is trained to predict logP values based on a one-hot encoded string representation (SELFIES) of molecules. The NN is trained in the usual way: a molecule is input, the predicted logP value is compared to the true value, and the NN weights are adjusted then adjusted to minimise the the difference - a process that is repeated for a certain number of epochs.

Once trained, the process is then reversed. A target logP value is chosen together with an arbitrary molecule. The difference in predicted and target logP value is then minimised by adjusting the one-hot encoded representation of the molecule - a process that is repeated for a certain number of epochs.

In both cases the adjustments are done based the gradient of the error with respect to weights (in the first case) and the one-hot encoded vectors (in the second case). Since the start vector is binary, but is changed to a real number vector after the optimisation starts there are some convergence problems. The authors show that this can be addressed by changing the 0's in the one-hot encoding randomly to some number between 0 and a maximum value.

Since selfies are being used, every vector representation can be resolved to a molecule, which means that one can also analyse the optimisation path to gain insight into how the NN translates molecules into a property prediction.

## Monday, November 30, 2020

### An Open Drug Discovery Competition: Experimental Validation of Predictive Models in a Series of Novel Antimalarials

Highlighted by Jan Jensen

Figure 4 from the paper. (c) 2020 The authors. Reproduced under the CC BY-NC-ND 4.0 license.

While there are many ML-based design studies in the literature it is quite rare so see one with experimental verification. The Open Source Malaria (OSM) made two rounds of antimalarial activity data available and invited researchers to use this data to develop predictive models and identify molecules with high potency for synthesis. Here I'll focus on the second round that started in 2019, were the participants worked with a ~400 compound dataset.

Here 10 teams from both industry and academia submitted models (classifiers) that were judged by a panel of experts using a held-back dataset. The four teams with the highest scoring models (with a precision between 81% and 91%) were then asked to submit two new molecules each for experimental verification: one possessing a triazolopyrazine core and one without. However, the latter compounds all proved synthetically inaccessible, as did two with the triazolopyrazine core. Thus, a total of six molecules were synthesised and tested and ...
Three of the six compounds were found to be active(<1 μM) or moderately active (1–2.5 μM) in in  vitro growth assays with  asexual blood-stage P.  falciparum (3D7) parasites, representing a hit rate of 50% on a small sample size. Up to this point a total of 398 compounds had been made and evaluated for in  vitro activity in OSM Series 4, with the design of these compounds driven entirely by the intuition of medicinal chemists. By setting a potency cut-off of 2.5 μM (the upper  limit of reasonable  activity), the tally of active compounds discovered in this series stands at 165, representing a comparable human intuition-derived hit rate of 41% on a larger sample size.
Interestingly, the Optibrium/Intelligens candidate was initially met with a great deal of scepticism by the expert panel but turned out to be the best overall candidate.

## Friday, October 30, 2020

### Identifying domains of applicability of machine learning models for materials science

Christopher Sutton, Mario Boley, Luca M. Ghiringhelli, Matthias Rupp, Jilles Vreeken, Matthias Scheffler (2020)
Highlighted by Jan Jensen

Figure 3 from the paper (c) The authors 2020. Reproduced under the CC-BY license

This paper applies subgroup discovery (SGD) to detect domain applicability (DA) of three ML models for predicting formation energies of certain solid state materials. The authors define several DA features such as unit cell dimensions, composition, and interatomic distances. The features are different than the (much more complex) representations used as input to the ML models. The SGD algorithm then uses the DA features together with the ML-model errors to determine a selector (σf) by finding the largest possible subgroup of molecular systems (coverage) with the lowest possible error.

The selector is a definition of this subgroup in terms of the some of the DA features, which are automatically chosen by the SGD algorithm. For example, the DA of one of the models is defined by three DA features:

where "^" means "and". The MAE for this DA is 7.6 meV/cation, compared to 14.2 meV/cation for the test set used to train the ML model.

Interestingly, the three ML models this analysis was applied to had virtually the same overall MAEs but  different DAs and quite different MAEs within each domain. Also, the coverage of each DA varied considerably.

The SGD method appears to be a very useful and generally applicable tool for ML. The SGD algorithm used for this study is freely available here.

## Monday, September 28, 2020

### Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors

Yanfei Guan, Connor W. Coley, Haoyang Wu, Duminda Ranasinghe, EstherHeid, Thomas J. Struble, Lagnajit Pattanaik, William H. Green, and Klavs F. Jensen (2020)

Highlighted by Jan Jensen

Figure 3 from the paper. (c) The authors 2020 reproduced under the CC BY-NC-ND 4.0 license

The red and green columns show the accuracy of regioselectivity prediction as a function of training set size (N) for two ML-models: one based on QM descriptors and the other based on a graph NN (GNN). For N = 200 QM outperforms GNN by 9%, but the performance of QM doesn't improve by more than 1.5% for larger training sets. GNN does improve and ends up outperforming QM by 2.5% for large training sets.

Combining QM and GNN (QM-GNN) gives roughly the same accuracy as QM and GNN for small and large training sets, respectively. To remove the cost of the QM, a separate GNN model for the QM descriptors is developed and combined with the GNN model of regioselectivity (ml-QM-GNN), which gives roughly the same results at much faster speed. Note that this GNN descriptor model is trained on a different, and much larger, data set (since no experimental data is needed) and can be used to augment other types of predictions.

The fact that ml-QM-GNN outperforms QM-GNN for N = 200 indicates the accuracies are good to no more than +/- 1%, so the slightly better performance for ml-QM-GNN compared to GNN for N = 2000 is not real. So ml-QM only enhances the accuracy for ca N < 1000 for this particular property, but is definitely worth doing for problems with only a few hundred data points. Especially now that the ml-QM model is already has been developed.

## Monday, August 31, 2020

### A community-powered search of machine learning strategy space to find NMR property prediction models

Lars A. Bratholm, Will Gerrard, Brandon Anderson, Shaojie Bai, Sunghwan Choi, Lam Dang, Pavel Hanchar, AddisonHoward, Guillaume Huard, Sanghoon Kim, Zico Kolter, Risi Kondor, Mordechai Kornbluth, YouhanLee, Youngsoo Lee, Jonathan P. Mailoa, Thanh Tu Nguyen, Milos Popovic, Goran Rakocevic, Walter Reade, Wonho Song, Luka Stojanovic, Erik H. Thiede, Nebojsa Tijanic, Andres Torrubia, Devin Willmott, Craig P. Butts, David R. Glowacki, & Kaggle participants (2020)

Highlighted by Jan Jensen

Figure 1a and 1b from the paper (c) The authors. Reproduced under the CC-BY licence

Disclaimer: I was Lars Bratholms PhD advisor

This paper describes the results of a Kaggle competition called Champs for developing ML models that predict NMR coupling constants with DFT accuracy.

In a Kaggle competition the host of the competition provides a public training and test set. Participants use these datasets to develop ML models, which the site then evaluates on a private test set. The accuracy of each model is posted and the object of the competition is submit the most accurate model before the end of the competition. Competitors can submit as often as they want during the competition, which in this case lasted 3 months. The winners receive cash prices: in this case the top 5 models received \$12.5K, \$7.5K, \$5K, \$3K, and \\$2K, respectively.
"[Champs] received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published ‘in-house’ efforts. A meta-ensemblemodel constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art."
Two of the top 5 teams had no domain specific expertise.

Is this the way of the future? Should any chemistry ML proposal include Kaggle prize money in the budget? I don't see any scientific reasons why not.