Monday, July 31, 2023

Real-World Molecular Out-Of-Distribution: Specification and Investigation

Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, Emmanuel Noutahi (2023)
Highlighted by Jan Jensen

Part of Figure 1 from this report

Why do ML models perform much worse different test sets? There can be many reasons for such a shift in performance, but the main culprit is often a covariate shift meaning that the training and test set are quite different. This study seeks to quantify this effect for different molecular representations, ML algorithms, and datasets (both regression and classification).

The authors find that the difference between the test and train error (from a random split) is mostly governed by the representation (as opposed the the ML algorithm). Furthermore, representations that results in shorter distances between molecules (specifically 5-NN distances) on average are the ones that give a smaller difference in error between training and test set.  However, those representations do not necessarily result in lower test set errors. 

So you while you can't use representation distances to pick the representation you can use them to pick the best splitting method for obtaining your training set. The best test set it the one that with the shortest overall representation distance to the deployment set (i.e. the set you want to use your ML model on). The authors find that the best splitting method depends on the representation but is often scaffold splitting. 

Thanks to Cas Wogum for a very helpful discussion.

This work is licensed under a Creative Commons Attribution 4.0 International License.