Monday, June 29, 2020

What Does the Machine Learn? Knowledge Representations of Chemical Reactivity

Joshua A. Kammeraad, Jack Goetz, Eric A. Walker, Ambuj Tewari, and Paul M. Zimmerman (2020)
Highlighted by Jan Jensen

Figure 1 from the paper (c) American Chemical Society 2020

While I don't agree with everything said in the paper, I highlight it here because I found it very thought provoking. 

The paper tests several feature sets and ML modes for the prediction of activation energies and compares their performances to using Evans-Polanyi relationships (i.e. where the activation energy is a linear function of the reaction energy for certain reaction classes). The overall goal is to determine find a ML model that is "accurate & easy to interpret" (last panel of the figure above).

More specifically, the authors test SVM, NN, and 2-nearest neighbour models using several feature sets that all include the reaction energy. They find that all models and feature sets perform (roughly) equally well and conclude that "the machine-learning models do little more than memorize values from clusters of data points, where those clusters happened to be similar reaction types." 

Furthermore, the authors show that using an Evans−Polanyi model for different reaction is about 5% more accurate than the machine learning models suing one-hot encoding of atom and bond-types in addition to the reaction energy. They go one to write "This low-dimensionality model (2 parameters per reaction type) is algorithmically and conceptually easier to apply and can be evaluated using chemical principles, making it transferable to new reactions within the same class."

I would argue that the ML has rediscovered the Evans−Polanyi model. From an ML perspective, the feature set of the Evans−Polanyi model is the reaction energies and (a one-hot encoding of) the reaction types. This representation is shown to work quite well with the ML models, and the lack of improvement upon including more features (such as atomic charges) shows that (almost) all the information needed for accurate predictions is contained in the reaction energy. 

Furthermore, the fact that you get good results from the 2-nearest neighbour model (where the prediction is an average of the two nearest points) suggests that the relationship between between reaction energy and activation energy is linear. If the average is weighted and the linear relationship is exact, then one would get exact results from the 2-nearest neighbour model. 

The only "memorization" comes from the selection of reaction types. The selection of reaction types by the authors is done based on atom and bond types, so it's not surprising that a one-hot encoding of these properties also encodes these reaction types. 

Given the simplicity of the representation, 2-nearest neighbours or SVM do not necessarily require more data to parameterise than the Evans−Polanyi model.
In my opinion, the last panel in the figure above should be redrawn, so that the concepts (the coloured shapes) are the inputs to the model, which in this case are the reaction energies and (one-hot encoded) reaction types. 

These concepts are implicitly encoded in the molecular graph and can be learned by graph convolution using much more complex ML models and lots of data. But, in analogy with complex but accurate wavefunctions which also encodes these concepts implicitly, extracting them from a complex ML models is not necessarily possible. If one wants simple, qualitative explanations, one has to construct simple qualitative models.

As Robert Mulliken said more than 50 years ago, the more accurate (and complex) the calculations become the more the concepts tend to vanish into thin air. Nothing has changed in this regard.