Saturday, February 29, 2020

The Synthesizability of Molecules Proposed by Generative Models

Wenhao Gao and Connor W. Coley (2020)
Highlighted by Jan Jensen


Figure 1 from the paper. (c) The authors 2020. The paper tests method c, d, and e

Disclaimer: I implemented one of the methods (graph based GA) being tested. 

It is well known that generative models (including genetic algorithms) can suggest very weird-looking molecules when used to optimise molecular properties. This is the first paper that I have come across that tries to quantify this problem by computing their synthesizability.

A molecule is defined as synthesizable if a computer-assisted synthesis planning (CASP) program can find a synthetic route to the molecule. The CASP program they used (ASKCOS) can find synthetic routes for between 57-89% of molecules sampled from commonly used databases (or subsets) such as ChEMBL and ZINC. These databases generally contain molecules that have been made, so just because ASKCOS can't figure out how to make it doesn't mean it can't be made.

The authors used ASKCOS to determine the fraction of synthesizable molecules suggested by three generative models (one ML-based and two GA-based methods) for several "hard" optimisation problems. The ML-based method tends to predict higher fractions of synthesizable molecules compared to GAs and for some properties none of the 100 top-scoring molecules suggested by the GAs were deemed synthesizable. 

The authors go on to show that, in many cases,  the fraction of synthesizable molecules can be increased significantly by including an empirical synthesizability measure in the scoring function, which is very welcome news to me. Furthermore, the top synthesizable molecules shown in the paper look very reasonable, which suggests that CASP programs can weed out the crazy structures.

One worry is that CASP programs are overly conservative and weed out viable structures that could teach us some genuinely new chemistry, but if generative models are to be taken seriously we obviously need a method to exclude the crazy molecules before we show them to synthetic chemists.


Monday, February 10, 2020

On the Completeness of Atomic Structure Representations


Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples. 


FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point Cor C− is added leading to two degenerate environments, and − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.