Sunday, October 31, 2021

Explaining and avoiding failures modes in goal-directed generation

Maxime Langevin, Rodolphe Vuilleumier, and Marc Bianciotto (2021) 
Highlighted by Jan Jensen

Figure 1 from the paper. (c) the authors 2021. Reproduced in the CC-BY-NC license

When you use search algorithms to optimise molecular properties predicted by ML-models, there is always the danger of going into regions of chemical space where the ML model no longer makes accurate predictions. Last year Renz et al. tried to quantify this phenomenon and basically concluded that it is a big problem. The current paper does not agree.

Renz et al. develop three different RF models as shown in the figure above for classifying bioactivity. In principle, all three models should give the same predictions. A search algorithm is then used to find molecules for which one of the models (the optimisation model) predict high scores, and these molecules are rescored using the other two control models. As the search proceed, these scores begin to diverge, leading Renz et al. to conclude that the search algorithms exploit biases particular to the optimisation model and does not, in fact, predict molecules that are truly active.

I almost highlighted this paper when it first appeared but was concerned by the relatively small sizes of the data sets used: 842, 667, and 842 molecules with 40, 140, and 59 active molecules, respectively. The paper by Langevin et al. suggests that this concern was justified.  

First they created a holdout set of 10% of the molecules, and repeated the procedure by Renz et al. on the remaining 90%. They showed that the difference in performance for the holdout set are the same as those observed by Renz et al, i.e. these differences have to do with the models/training sets themselves and not necessarily with the search algorithms. 

To show that it, in fact, has nothing to do with the search algorithms, they then demonstrated that the difference in model performance can be significantly reduced using two different approaches. One is to split the two data sets such that they are as similar as possible. Another is to use a better RF model: 200 trees and at least 3 samples per leaf, instead of 100 trees and 1 sample per leaf originally used by Renz et al.

This work is licensed under a Creative Commons Attribution 4.0 International License.