Sunday, April 26, 2015

Big Data Meets Quantum Chemistry Approximations: The ∆-Machine Learning Approach

Contributed by +Jan Jensen 
Figure 1. Two hypothetical property profiles connecting two constitutional isomers of C$_7$H_$_{10}$O$_2$. The Δ-model, estimates the difference between baseline and target line properties (arrow) which differ in level of theory (b → t), geometry ($R_b$ → $R_t$), and property ($E_b$ → $H_t$). Reprinted with permission from J. Chem. Theory Comput. 2015, ASAP. Copyright (2015) American Chemical Society.

The idea behind this method is best explained by a specific example.  The G4MP2 enthalpies [$H_t(R_t)$] of  C$_7$H_$_{10}$O$_2$ isomers are estimated using PM7 electronic energies [$E_t(R_b)$] by 
$$H_t(R_t) \approx \Delta_b^t(R_b) = E_b(R_b)+ \sum_{i=1}^N\alpha_i e^{|R_i-R_b|/\sigma}$$
Here {$\alpha_i$} and $\sigma$ are parameters found by regression using a training set of $N$ molecules and $|R_i-R_b|$ is a measure of similarity between the target molecule and training molecule $i$.  The latter is described in more detail here, but I found it pretty interesting so I am summarizing it here.

A Coulomb matrix ($\mathbf{C}$) is constructed for each molecule
C_{kl}= \begin{cases}
 0.5 Z_k^{2.4} & \text{if }  i=j\\
 Z_kZ_l/r_{kl}& \text{if } i \ne j
where $r_{kl}$ is the distance between atom $k$ and $l$ and $Z_k$ is the nuclear charge of atom $k$. Then the elements are sorted such that the diagonal elements are in descending order and the similarity is computed by
$$|R_i-R_b| = \sum_{k,l} |C_{kl}^i - C_{kl}^b | $$
Using this approach and a training set of ($N$ =) 1000 molecules the G4MP2 atomization enthalpies of 6095 constitutional isomers of C$_7$H_$_{10}$O$_2$ can be reproduced with a MAE of 3.9 kcal/mol using PM7, compared to an MAE of 6.4 kcal/mol for uncorrected PM7.  Using PBE or B3LYP/6-31G(2df,p) the MAE can be brought below 1 kcal/mol using a 1K training set.

In another interesting application the MAE of RHF/6-31G(d) relative to CCSD(T)/6-31G(d) atomization energies for the same set of molecules can be reduced from 3 to less than 1 kcal/mol using a 1K training set.

This is thus a very interesting approach for obtaining chemical accuracy using methods that are sufficiently fast to study thousands of molecules. The caveat is that about 1000 high level calculations appears to be needed to train the method but perhaps more generally applicable parameter sets can be found using, for example, functional group identification.

This work is licensed under a Creative Commons Attribution 4.0  


  1. At that point there was one, information experts at one of the rumored organization in Mumbai. data science course in pune

  2. Well, The information which you posted here is very helpful & it is very useful for the needy like me.., Wonderful information you posted here. Thank you so much for helping me out to find the Data science course in Mumbai Organisations and introducing reputed stalwarts in the industry dealing with data analyzing & assorting it in a structured and precise manner. Keep up the good work. Looking forward to view more from you.


  3. Always so interesting to visit your site.What a great info, thank you for sharing. this will help me so much in my learning.

    data science training