Sunday, February 28, 2021

Uncertainty Quantification Using Neural Networks for Molecular Property Prediction

Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley 2020
Highlighted by Jan Jensen

Figure 3 from the paper. (c) American Chemical Society 2020

Given the blackbox nature of ML models it is very important to have some measure of how much to trust their predictions. There are many ways to do this paper shows "none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple data sets."

This conclusion is neatly summarised in the figure shown above for 5 common datasets, 2 different ML methods, and 4 different methods for uncertainty quantification. For each combination of these the plot shows the RMSE for for the 100, 50, 25, 10, and 5% of the test set on which the uncertainty quantification method calculated the lowest uncertainty for the hold-out set.

Generally, the RMSE drops as expected but the drops are in many cases decidedly modest past 50% and it can even increase in some cases. In most cases there is very little difference between the different uncertainty quantification methods, but sometimes there is and it's hard to predict when.

One thing that struck me when reading this paper is that many studies who include uncertainty quantification, e.g. using the ensemble approach, often just take it for granted that it works and don't present tests like this.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, January 30, 2021

Accelerating High-Throughput Virtual Screening Through Molecular Pool-Based Active Learning

David E. Graff, Eugene I. Shakhnovich, and Connor W. Coley (2020)
Highlighted by Jan Jensen

Figure 1 and part of Figure 2 from the paper. (c) The authors 2021. 

This paper shows how to find the highest scoring molecules in a very large library of molecules by scoring only a very small percentage of the library. The focus of the paper is docking scores but it can in principle to be used for any molecular property.

The general approach is simple: 

1. Start by picking a random sample of the library (say 100 molecules out of a library of 10.000 molecules) and evaluate their scores. 

2. Use these 100 points to train a machine-learning (ML) model to predict the scores.

3. Screen all 10,000 molecules using the ML model. The assumption is that training/using the ML model is much cheaper than evaluating the score.

4. Select the 100 best molecules according to the ML model, compute the scores, and use them to retrain the ML model.

5. Repeat steps 3 and 4.

The best molecules could be the best-scoring molecules (this a known as "greedy" optimisation). However, if the uncertainty of the ML prediction for each molecule can be quantified, there are several other options for what best is (use of these approaches are referred to as Bayesian optimisation).  The study investigates four selection functions involving standard deviations but finds the greedy approach works best.

The approach is tested on three different datasets with known docking scores of varying sizes (10K, 50K, 2M, and 99M). The study tests three different machine learning models: RF and NN using fingerprints as well as a graph convolutional model (which works best) and various choices batch sizes. 

In the case of the 99M dataset more than half of the top-50,000 scoring molecules can be found by docking only 600K molecules using this approach. 

However, let's turn that last sentence around: if you're developing an ML model to find high-scoring molecules your training set size needs to be 600K. Furthermore, the study shows that if you just pick 500K random molecules for your training set, your ML model won't identify any of the top-50,0000 molecules. You have to build this very large training set in this iterative fashion to get an ML model that can reliably identify the top-scoring molecules.