Sunday, April 26, 2015

Big Data Meets Quantum Chemistry Approximations: The ∆-Machine Learning Approach

Contributed by +Jan Jensen 
Figure 1. Two hypothetical property profiles connecting two constitutional isomers of C$_7$H_$_{10}$O$_2$. The Δ-model, estimates the difference between baseline and target line properties (arrow) which differ in level of theory (b → t), geometry ($R_b$ → $R_t$), and property ($E_b$ → $H_t$). Reprinted with permission from J. Chem. Theory Comput. 2015, ASAP. Copyright (2015) American Chemical Society.

The idea behind this method is best explained by a specific example.  The G4MP2 enthalpies [$H_t(R_t)$] of  C$_7$H_$_{10}$O$_2$ isomers are estimated using PM7 electronic energies [$E_t(R_b)$] by 
$$H_t(R_t) \approx \Delta_b^t(R_b) = E_b(R_b)+ \sum_{i=1}^N\alpha_i e^{|R_i-R_b|/\sigma}$$
Here {$\alpha_i$} and $\sigma$ are parameters found by regression using a training set of $N$ molecules and $|R_i-R_b|$ is a measure of similarity between the target molecule and training molecule $i$.  The latter is described in more detail here, but I found it pretty interesting so I am summarizing it here.

A Coulomb matrix ($\mathbf{C}$) is constructed for each molecule
C_{kl}= \begin{cases}
 0.5 Z_k^{2.4} & \text{if }  i=j\\
 Z_kZ_l/r_{kl}& \text{if } i \ne j
where $r_{kl}$ is the distance between atom $k$ and $l$ and $Z_k$ is the nuclear charge of atom $k$. Then the elements are sorted such that the diagonal elements are in descending order and the similarity is computed by
$$|R_i-R_b| = \sum_{k,l} |C_{kl}^i - C_{kl}^b | $$
Using this approach and a training set of ($N$ =) 1000 molecules the G4MP2 atomization enthalpies of 6095 constitutional isomers of C$_7$H_$_{10}$O$_2$ can be reproduced with a MAE of 3.9 kcal/mol using PM7, compared to an MAE of 6.4 kcal/mol for uncorrected PM7.  Using PBE or B3LYP/6-31G(2df,p) the MAE can be brought below 1 kcal/mol using a 1K training set.

In another interesting application the MAE of RHF/6-31G(d) relative to CCSD(T)/6-31G(d) atomization energies for the same set of molecules can be reduced from 3 to less than 1 kcal/mol using a 1K training set.

This is thus a very interesting approach for obtaining chemical accuracy using methods that are sufficiently fast to study thousands of molecules. The caveat is that about 1000 high level calculations appears to be needed to train the method but perhaps more generally applicable parameter sets can be found using, for example, functional group identification.

This work is licensed under a Creative Commons Attribution 4.0