Claudio Zeni, Andrea Anelli, Aldo Glielmo, and Kevin Rossi (2021)

Highlighted by Jan Jensen

ML models are generally thought to only interpolate, but this paper suggests that this is not the case. On first sight this seems counterintuitive but on some reflection this may not be so strange at all.

First of all, the authors define an extrapolation as a prediction for a point outside (red point) the Convex Hull (blue line) defined by the training set points (blue points). They perform this analysis for three train/test sets related to solid state chemistry and show that between 80% and 100% of the test sets data points lie outside the Convex Hull defined by the training set data points, but ML models trained on the training set perform satisfactorily for the test set (hence the title).

While this might seem counterintuitive at first, is it really so strange that a model trained on the blue points performs better for the red point than the green point? The red point is closer to the the blue points and there is really only extrapolation in the x direction.

The representation vectors used in this study all have at least 100 dimensions and a point is said to correspond to an extrapolation if it lies outside the Convex Hull in only one of these dimensions. By using PCA the authors show that in some cases extrapolation occurs for all test points when considering only the 10 most important dimensions, while 20 dimensions are needed for truly accurate results. However, for most cases reasonable accuracy can be obtained with 4 dimensions, where more than 90% of the test set is contained in the Convex Hull of the training set. So IMO the picture is not as clear cut as the title suggests.

The authors show that the best predictor of accuracy is the density of training set points in the region of the test set molecule.

This work is licensed under a Creative Commons Attribution 4.0 International License.