Tuesday, March 31, 2020

Semiautomated Transition State Localization for Organometallic Complexes with Semiempirical Quantum Chemical Methods

Highlighted by Jan Jensen


Automated and efficient TS searches is difficult and there are only a few benchmark studies out there. But this is the first paper I have come across where they attempt this for organometallics. Given the typical size of organometallic compounds, one needs something faster than DFT for efficiency so semiempirical QM (SQM) methods are the obvious choice as long as these simpler methods can describe the chemistry accurately.

The authors have test MOPAC and xTB interfaces to Zimmerman's growing string method (mGSM) on the 34 unimolecular reactions in the MOBH35 benchmark set. I couldn't find an explanation for the focus on unimolecular reactions but the reason might be that it is easier to geometrically align reactants and products for these reactions.

GFN1-xTB and GFN2-xTB find reaction paths for 31 and 30 reactions, respectively, while the corresponding numbers for PM6-D3H4 and PM7 are 26 and 25, respectively. GFN2-xTB fails to find barriers for 2 reactions with < 1.5 kcal/mol barriers, so if these are discounted then GFN2-xTB performs best. 

The TS-guess structures (the highest energy point on the reaction paths) are generally in good agreement with DFT, with heavy atom RMSDs of >0.3Å. It would have been interesting to know how many DFT TS searchers converge starting from the SQM structures. The xTB barrier heights compare reasonably well with DFT, with a MAD of 8-9 kcal/mol. 

Saturday, February 29, 2020

The Synthesizability of Molecules Proposed by Generative Models

Wenhao Gao and Connor W. Coley (2020)
Highlighted by Jan Jensen


Figure 1 from the paper. (c) The authors 2020. The paper tests method c, d, and e

Disclaimer: I implemented one of the methods (graph based GA) being tested. 

It is well known that generative models (including genetic algorithms) can suggest very weird-looking molecules when used to optimise molecular properties. This is the first paper that I have come across that tries to quantify this problem by computing their synthesizability.

A molecule is defined as synthesizable if a computer-assisted synthesis planning (CASP) program can find a synthetic route to the molecule. The CASP program they used (ASKCOS) can find synthetic routes for between 57-89% of molecules sampled from commonly used databases (or subsets) such as ChEMBL and ZINC. These databases generally contain molecules that have been made, so just because ASKCOS can't figure out how to make it doesn't mean it can't be made.

The authors used ASKCOS to determine the fraction of synthesizable molecules suggested by three generative models (one ML-based and two GA-based methods) for several "hard" optimisation problems. The ML-based method tends to predict higher fractions of synthesizable molecules compared to GAs and for some properties none of the 100 top-scoring molecules suggested by the GAs were deemed synthesizable. 

The authors go on to show that, in many cases,  the fraction of synthesizable molecules can be increased significantly by including an empirical synthesizability measure in the scoring function, which is very welcome news to me. Furthermore, the top synthesizable molecules shown in the paper look very reasonable, which suggests that CASP programs can weed out the crazy structures.

One worry is that CASP programs are overly conservative and weed out viable structures that could teach us some genuinely new chemistry, but if generative models are to be taken seriously we obviously need a method to exclude the crazy molecules before we show them to synthetic chemists.


Monday, February 10, 2020

On the Completeness of Atomic Structure Representations


Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples. 


FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point Cor C− is added leading to two degenerate environments, and − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.   

Wednesday, January 29, 2020

Discovery of a Difluoroglycine Synthesis Method through Quantum Chemical Calculations

Tsuyoshi Mita, Yu Harabuchi and Satoshi Maeda (2020)
Highlighted by Jan Jensen

TOC graphic. © The Authors 2020. Reproduced under the CC-BY-NC-ND 4.0 license.

In this paper the authors use DFT calculations to identify a synthetic route to difluoroglycine. 

They started by applying the single component artificial force induced reaction (SC-AFIR) method to difluoroglycine. In the SC-AFIR artificial forces are introduced between functional groups which forces them to either react or dissociate from one another. 

This yielded 288 equilibrium structures and 309 transition states. The selected NH3 + :CF2 + CO2 for further study because the reaction is 1) predicted to be very exothermic (i.e. high yield), 2) has a low barrier, and 3) NH3 and CO2 are readily available.

:CF2 can be generated by a variety of methods and the authors initially chose Me3SiCF3, which generates CF3-, which in turn dissociates to :CF2 and F-. They then generated the reaction network for NH3 + CF3- + CO2 and performed a kinetic analysis, which predicted that "the calculated yield of difluoroglycine is almost zero because the equilibrium between CF3- and CF3 + F- favours the former. As a result, CF3CO2-, in which CF3- is directly bound to CO2, was obtained as the main product (99.8%)."

A similar analysis was performed for NH3 + CF2Br- + CO2, which predicted a higher yield for difluoroglycine, but also a minor by-product NH2CO2CHF2 due to proton transfer from NH3 to :CF2. Thus, to increase the yield, the authors repeated the analysis for NMe3 + CF2Br- + CO2, which predicted a >99% yield.

Finally, the predicted synthetic route was tested experimentally and the reaction conditions (such as solvent, temperature, and silane activator) optimised resulting in a 96% yield. However, it was only possible to purify the ester.

This is the first study I have seen where a synthetic route (of an admittedly very simple molecule) is predicted from DFT calculations. Hopefully the first of many. However, as the authors note "It is  undeniable that  the  experience  and  intuition of  chemists,  or even luck, contributed to appropriate choices being made."

The calculations were performed with the GRRM17 program. It appears to be free, but I don't believe it is open source.



This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, December 31, 2019

Schrödinger-ANI: An Eight-Element Neural Network InteractionPotential with Greatly Expanded Coverage of Druglike Chemical Space

James M. Stevenson, Leif D. Jacobson, Yutong Zhao, Chuanjie Wu, Jon Maple, Karl Leswing, Edward Harder, and Robert Abel (2019)
Highlighted by Jan Jensen



There are a lot of ML models trained on HCNO data sets. This is fine for proof-of-concept, but severely limits applications to real world problems. For example HCNO-only molecules comprise only 46% of molecules in the CHEMBL database of drug-like molecules.

The main problem with extending these methods to other elements is that the size of the chemical space grows non-linearly with respect to the number of different elements. Furthermore, there aren't any generally available and comprehensive sets of molecules similar to QMx or GDB-x.

The current study extends the ANI-1 method to H, C, N, O, S, F, Cl, and P, which covers 94% of CHEMBL. The authors use a combination of stochastic sampling and extensive pre-screening to distill the training/validation set to only 10 million DFT single point calculations on relatively small molecules, which took "just a few days and at a very reasonable compute cost".

The main focus was on relative conformer energies, since the bulk of CPU time for many studies is typically spent on conformational searches. The RMSE for this data is 0.70 kcal/mol relative to DFT, which is quite impressive.

As the name suggests, the work was done at Schrödinger, so the method is not open sourced. However, an earlier version for 4 elements is available here. More importantly, the methodology behind the dataset generation is well described and appears to be practically feasible for academic labs.


This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, November 27, 2019

Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions

K. T. Schütt, M. Gastegger, A. Tkatchenko, K.-R. Müller & R. J. Maurer  (2019)
Highlighted by Jan Jensen

Figure 2 from the paper. © The Authors 2019. Reproduced under the CC-BY license.

This paper had received a lot of attention, so I had to see what the fuzz was about. The method (SchNOrb) uses a deep neural network to create a Hamiltonian matrix from 3D coordinates, which can then be diagonalised to yield orbital energies and orbitals. SchNOrb is trained to reproduce Hamiltonian and overlap matrix elements, total energies (which are computed as the sum of orbital energies in the ML model), and gradients all taken from AIMD trajectories. 

So it's a bit like the ANI-1 method except that you also get orbitals, which can be used to compute other properties without additional parameterisation. One crucial difference though is that, as far as I can tell, SchNOrb is parameterised for each molecule separately. 

The model uses about 93 million parameters for a >100 AO Hamiltonian and requires 25,000 structures and 80 GPU hours to train. Once trained it can predict an energy with meV accuracy in about 50 ms on a GPU.

The software is "available upon request".

The reviews are made available, and well worth reading.

Thursday, October 31, 2019

The minimum parameterization of the wave function for the many-body electronic Schrödinger equation. I. Theory and ansatz

Lasse Kragh Sørensen (2019)
Highlighted by Jan Jensen

It is well known that Full CI (FCI) scales exponentially with the basis set size. However, Sørensen claims that the "whole notion of the exponentially scaling nature of quantum mechanics simply stems from expanding the wavefunction in a sub-optimal basis",  i.e. one-electron functions. Sørensen goes on to argue that of two-ele tron functions (geminals) are used instead, the scaling is reduced to m(m−1)/2 where m is the number of basis functions.  Furthermore, because "the number of parameters is independent of the electronic structure the multi-configurational problem is a mere phantom summoned by a poor choice of basis for the wave function". 

I don't have the background to tell whether the arguments in this paper are correct and the main point os this post is to see if I can get some feedback from people who do have the background. 

In principle one could simply compare the new approach to FCI calculations but the new method isn't quite there yet:
A straight forward minimization of Eq. 84 unfortunately gives the solution for the two-electron problem of a +(N−2) charged system N/2 times so additional constraints must be introduced. These constraints can be found by the property of the wave function in limiting cases. The problem of finding the constraints for the geminals in the AGP ansatz is therefore closely related to finding the N-representability constraints for the two-body reduced density matrix (2-RDM). For the N-representability Mazziotti have showed a constructive solution[74] though the exact conditions are still illusive.[75, 76]