Monday, June 29, 2020

What Does the Machine Learn? Knowledge Representations of Chemical Reactivity

Joshua A. Kammeraad, Jack Goetz, Eric A. Walker, Ambuj Tewari, and Paul M. Zimmerman (2020)
Highlighted by Jan Jensen

Figure 1 from the paper (c) American Chemical Society 2020

While I don't agree with everything said in the paper, I highlight it here because I found it very thought provoking. 

The paper tests several feature sets and ML modes for the prediction of activation energies and compares their performances to using Evans-Polanyi relationships (i.e. where the activation energy is a linear function of the reaction energy for certain reaction classes). The overall goal is to determine find a ML model that is "accurate & easy to interpret" (last panel of the figure above).

More specifically, the authors test SVM, NN, and 2-nearest neighbour models using several feature sets that all include the reaction energy. They find that all models and feature sets perform (roughly) equally well and conclude that "the machine-learning models do little more than memorize values from clusters of data points, where those clusters happened to be similar reaction types." 

Furthermore, the authors show that using an Evans−Polanyi model for different reaction is about 5% more accurate than the machine learning models suing one-hot encoding of atom and bond-types in addition to the reaction energy. They go one to write "This low-dimensionality model (2 parameters per reaction type) is algorithmically and conceptually easier to apply and can be evaluated using chemical principles, making it transferable to new reactions within the same class."

I would argue that the ML has rediscovered the Evans−Polanyi model. From an ML perspective, the feature set of the Evans−Polanyi model is the reaction energies and (a one-hot encoding of) the reaction types. This representation is shown to work quite well with the ML models, and the lack of improvement upon including more features (such as atomic charges) shows that (almost) all the information needed for accurate predictions is contained in the reaction energy. 

Furthermore, the fact that you get good results from the 2-nearest neighbour model (where the prediction is an average of the two nearest points) suggests that the relationship between between reaction energy and activation energy is linear. If the average is weighted and the linear relationship is exact, then one would get exact results from the 2-nearest neighbour model. 

The only "memorization" comes from the selection of reaction types. The selection of reaction types by the authors is done based on atom and bond types, so it's not surprising that a one-hot encoding of these properties also encodes these reaction types. 

Given the simplicity of the representation, 2-nearest neighbours or SVM do not necessarily require more data to parameterise than the Evans−Polanyi model.
In my opinion, the last panel in the figure above should be redrawn, so that the concepts (the coloured shapes) are the inputs to the model, which in this case are the reaction energies and (one-hot encoded) reaction types. 

These concepts are implicitly encoded in the molecular graph and can be learned by graph convolution using much more complex ML models and lots of data. But, in analogy with complex but accurate wavefunctions which also encodes these concepts implicitly, extracting them from a complex ML models is not necessarily possible. If one wants simple, qualitative explanations, one has to construct simple qualitative models.

As Robert Mulliken said more than 50 years ago, the more accurate (and complex) the calculations become the more the concepts tend to vanish into thin air. Nothing has changed in this regard.

Sunday, May 31, 2020

Learning Molecular Representations for Medicinal Chemistry

Kangway V. Chuang, Laura M. Gunsalus, and Michael J. Keiser (2020)
Highlighted by Jan Jensen

Figure 3 from the paper. (c) ACS 2020.

I found this miniperspective a very enjoyable read. It covers much more than the title suggests (at least to me), such as a mini history of deep learning in MedChem, when to use deep learning and when to use other ML techniques such as regression of random forest (see the figure above), and some of the fundamental challenges of using ML and generative models in MedChem (just to name a few). 

I found the last topic particularly interesting and include two of my favourite quotes from the paper below, but I really recommend that you read the entire paper.

Critically, small-molecule drug discovery breaks standard assumptions in many technological applications of machinelearning. Most machine learning algorithms operate on the assumption that training and testing data are independently and identically distributed (the i.i.d. assumption). For example, we would expect a standard image classifier trained to exclusively distinguish cats from dogs to generalize to new images of cats and dogs. This model will likely produce nonsensical classifications if asked to evaluate pictures of humans. In stark contrast, real-world drug-discovery breaks this standard i.i.d.assumption. The optimization and design of small molecules necessarily explore structural variations drawn from intentionally novel regions of chemical space. Large structural changes to small-molecule hits are typically required to become a lead. For am odel to be useful to the practicing medicinal chemist, it must generalize to out-of-distribution examples.

Critically, if generativemodels are to guide drug design, they cannot merely producetrivial extensions of the training data set. It remains unclear whether the latent spaces of generative models, which effectively interpolate across the chemical space of the training data, are capable of usefully extrapolating into new regions of chemical structure space. Furthermore, current generative models are torn between novelty and accessibility.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, May 18, 2020

Open Graph Benchmark: Datasets for Machine Learning on Graphs

A diverse collection of datasets for use in ML applications to graphs has been collected by Hu et al. The Benchmark is intuitively structured and includes evaluation protocols and metrics. Furthermore, the authors have reported the measured performance of a few popular approaches within each application (e.g., ROC-AUC,PRC-AUC, hits, or accuracy). There are several datasets in all three classes of tasks: Node property prediction (ogbn-), link property (ogbl-) prediction, and graph property prediction (ogbg-). 

Of particular interest to those of us who work in biochemistry broadly defined are the SMILES molecular graphs adapted from MoleculeNet [2] such as ogbg-molhiv (HIV) and ogbg-pcba (PubChem Bio Assay); however, also ogbl-ppa (Protein-Protein Association) and ogbn-proteins (Protein-Protein Association) are of interest. Note that MoleculeNet is not included in its entirety - far from it. So, that resource is definitely also interesting to have a close look at if you have not already explored it.

If you are the competitive type, your efforts can be submitted to scoreboards at the hosting website:

Thursday, April 30, 2020

Reactants, products, and transition states of elementary chemical reactions based on quantum chemistry

Colin A. Grambow, Lagnajit Pattanaik, William H. Green (2020)
Highlighted by Jan Jensen

Figure 1 from the paper. Reproduced under the CC BY-NC-ND 4.0 licence

This paper describes a new data set of DFT barrier heights for 12,000 diverse chemical reactions and should stimulate a lot of new ML studies on chemical reactivity.

The molecules are sampled from GDB-7 so they are relative small and contain only H, C, N, and O.  Each reaction is generated from a single molecule using single-ended GSM, so reactions with two reactants and two products are not represented in the data set. Other than these limitations the data set is very diverse:

The reactions span a wide range of both barriers and reaction energies (as seen in the figure above). Reactions with anywhere from 1 to 6 bond changes are represented (though there are only a handful with 6) as are changes to pretty much all bond types (C-H, C-C, C-N, etc). There are only 8 reaction templates with more than 100 examples and many have only a single reaction example. So, very diverse.

Best of all the authors provide atom-mapped reaction SMILES along with the barriers and reaction energies, which makes further benchmarking, analysis, and ML-studies very easy. It will be very exciting to see this data being put to good use!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, March 31, 2020

Semiautomated Transition State Localization for Organometallic Complexes with Semiempirical Quantum Chemical Methods

Highlighted by Jan Jensen

Automated and efficient TS searches is difficult and there are only a few benchmark studies out there. But this is the first paper I have come across where they attempt this for organometallics. Given the typical size of organometallic compounds, one needs something faster than DFT for efficiency so semiempirical QM (SQM) methods are the obvious choice as long as these simpler methods can describe the chemistry accurately.

The authors have test MOPAC and xTB interfaces to Zimmerman's growing string method (mGSM) on the 34 unimolecular reactions in the MOBH35 benchmark set. I couldn't find an explanation for the focus on unimolecular reactions but the reason might be that it is easier to geometrically align reactants and products for these reactions.

GFN1-xTB and GFN2-xTB find reaction paths for 31 and 30 reactions, respectively, while the corresponding numbers for PM6-D3H4 and PM7 are 26 and 25, respectively. GFN2-xTB fails to find barriers for 2 reactions with < 1.5 kcal/mol barriers, so if these are discounted then GFN2-xTB performs best. 

The TS-guess structures (the highest energy point on the reaction paths) are generally in good agreement with DFT, with heavy atom RMSDs of >0.3Å. It would have been interesting to know how many DFT TS searchers converge starting from the SQM structures. The xTB barrier heights compare reasonably well with DFT, with a MAD of 8-9 kcal/mol. 

Saturday, February 29, 2020

The Synthesizability of Molecules Proposed by Generative Models

Wenhao Gao and Connor W. Coley (2020)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) The authors 2020. The paper tests method c, d, and e

Disclaimer: I implemented one of the methods (graph based GA) being tested. 

It is well known that generative models (including genetic algorithms) can suggest very weird-looking molecules when used to optimise molecular properties. This is the first paper that I have come across that tries to quantify this problem by computing their synthesizability.

A molecule is defined as synthesizable if a computer-assisted synthesis planning (CASP) program can find a synthetic route to the molecule. The CASP program they used (ASKCOS) can find synthetic routes for between 57-89% of molecules sampled from commonly used databases (or subsets) such as ChEMBL and ZINC. These databases generally contain molecules that have been made, so just because ASKCOS can't figure out how to make it doesn't mean it can't be made.

The authors used ASKCOS to determine the fraction of synthesizable molecules suggested by three generative models (one ML-based and two GA-based methods) for several "hard" optimisation problems. The ML-based method tends to predict higher fractions of synthesizable molecules compared to GAs and for some properties none of the 100 top-scoring molecules suggested by the GAs were deemed synthesizable. 

The authors go on to show that, in many cases,  the fraction of synthesizable molecules can be increased significantly by including an empirical synthesizability measure in the scoring function, which is very welcome news to me. Furthermore, the top synthesizable molecules shown in the paper look very reasonable, which suggests that CASP programs can weed out the crazy structures.

One worry is that CASP programs are overly conservative and weed out viable structures that could teach us some genuinely new chemistry, but if generative models are to be taken seriously we obviously need a method to exclude the crazy molecules before we show them to synthetic chemists.

Monday, February 10, 2020

On the Completeness of Atomic Structure Representations

Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples. 

FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point Cor C− is added leading to two degenerate environments, and − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.