Highlighted by Jan Jensen
Figure 1 from the paper
The paper presents a method to estimate DFT or CCSD(T) energies (computed using large basis sets) based only on HF/cc-pVDZ densities and energies. In order to avoid overfitting, the method must also estimate the corresponding DFT or CCSD densities. The method is trained and validated on a subset of the QM9 data set (i.e. on relative small molecules). It is first trained on DFT data (using 89K molecules) and then retrained on CC data for a smaller subset (3.6K molecules), both being subsets of the QM9 data set (i.e. relatively small molecules). The input density is evaluated in a 3D grid that is big enough to accommodate the largest molecule in the data set, so a new model would have to be trained for significantly larger molecules.
This “physical machine learning in Quantum Chemistry” (PML-QC) model reaches a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, which is quite impressive. In fact the authors speculate that
With ML, it may become not required that an accurate quantum chemical method works fast enough for every new molecule that an end user may be interested in. Instead, the focus shifts to generating highly accurate results only for a finite dataset to be used for training, while the efficiency in practical applications is to be achieved via improvements in DNNs to make them faster and more accurate.
Much will depend how much the number of outliers can be reduced. For example, for PML-QC_{DFT} ca 5% of molecules have errors greater than 2.6 kcal/mol and this effect can be magnified for relative energies if the individual errors have opposite sign.