Saturday, August 31, 2019

Physical machine learning outperforms "human learning" in Quantum Chemistry

Highlighted by Jan Jensen


Figure 1 from the paper

The paper presents a method to estimate DFT or CCSD(T) energies (computed using large basis sets) based only on HF/cc-pVDZ densities and energies. In order to avoid overfitting, the method must also estimate the corresponding DFT or CCSD densities. The method is trained and validated on a subset of the QM9 data set (i.e. on relative small molecules). It is first trained on DFT data (using 89K molecules) and then retrained on CC data for a smaller subset (3.6K molecules), both being subsets of the QM9 data set (i.e. relatively small molecules). The input density is evaluated in a 3D grid that is big enough to accommodate the largest molecule in the data set, so a new model would have to be trained for significantly larger molecules.

This “physical machine learning in Quantum Chemistry” (PML-QC) model reaches a mean absolute error in energies of molecules with up to eight non-hydrogen atoms as low as 0.9 kcal/mol relative to CCSD(T) values, which is quite impressive. In fact the authors speculate that 
With ML, it may become not required that an accurate quantum chemical method works fast enough for every new molecule that an end user may be interested in. Instead, the focus shifts to generating highly accurate results only for a finite dataset to be used for training, while the efficiency in practical applications is to be achieved via improvements in DNNs to make them faster and more accurate.
Much will depend how much the number of outliers can be reduced. For example, for PML-QCDFT ca 5% of molecules have errors greater than 2.6 kcal/mol and this effect can be magnified for relative energies if the individual errors have opposite sign.

Wednesday, July 31, 2019

Popular Integration Grids Can Result in Large Errors in DFT-Computed Free Energies

Highlighted by Jan Jensen

 Figure 1B from the paper (CC BY-NC-ND 4.0)

 This paper has already been highlighted here and here, so I'll just briefly summarise.

The grid used for the numerical integration in DFT calculations is defined relative to the Cartesian axes, so rotating the molecule will change the integration grid and, hence, the energy. This has been known for some time and, f.eks. Gaussian09 uses a default grid size (Fine, 75,302) where the effect on the electronic energy variation is usually negligible.

Bootsma and Wheeler show that the vibrational entropy and, hence, the free energy is significantly more sensitive to grid size than the electronic energy. Using the Fine grid, the differences in relative free energy changes can be as large as 4 kcal/mol, which could significantly change conclusion regarding mechanisms, etc. The effect comes from the variation in low frequency vibrational modes and the effect can be reduced a little by scaling these frequencies. 

However, the errors really only become acceptable when using the UltraFine grid size, which is the default in Gaussian16, especially combined with frequency scaling (which one should do anyway to get consistent results). If you are using Gaussian09 or some other quantum program to compute relative free energies it is definitely a good idea to look at the default grid size and perform some tests.

Note that if you want to perform such tests yourself, you need to re-optimise the molecule after you rotate it because the gradient is also affected by the rotation.

Thursday, July 4, 2019

Combining the Power of J Coupling and DP4 Analysis on Stereochemical Assignments: The J-DP4 Methods

Grimblat, N.; Gavín, J. A.; Hernández Daranas, A.; Sarotti, A. M., Org. Letters 2019, 21, 4003-4007
Contributed by Steven Bachrach
Reposted from Computational Organic Chemistry with permission

I have written quite a number of posts on using quantum mechanics computations to predict NMR spectra that can aid in identifying chemical structure. Perhaps the most robust technique is Goodman’s DP4 method (post), which has seen some recent revisions (updated DP4DP4+). I have also posted on the use of computed coupling constants (posts).

Grimblat, Gavín, Daranas and Sarotti have now combined these two approaches, using computed 1H and 13C chemical shifts and 3JHH coupling constants with the DP4 framework to predict chemical structure.1

They describe two different approaches to incorporate coupling constants:
  • dJ-DP4 (direct method) incorporates the coupling constants into a new probability function, using the coupling constants in an analogous way as chemical shifts. This requires explicit computation of all chemical shifts and 3JHH coupling constants for all low-energy conformations.
  • iJ-DP4 (indirect method) uses the experimental coupling constants to set conformational constraints thereby reducing the number of total conformations that need be sampled. Thus, large values of the coupling constant (3JHH > 8 Hz) selects conformations with coplanar hydrogens, while small values (3JHH < 4 Hz) selects conformations with perpendicular hydrogens. Other values are ignored. Typically, only one or two coupling constants are used to select the viable conformations.

The authors test these two variants on 69 molecules. The original DP4 method predicted the correct stereoisomer for 75% of the examples, while dJ-DP4 correct identifies 96% of the cases. As a test of the indirect method, they examined marilzabicycloallenes A and B (1 and 2). DP4 predicts the correct stereoisomer with only 3.1% (1) or <0.1% (2) probability. dJ-DP4 predicts the correct isomer for 1 with 99.9% probability and 97.6% probability for 2. The advantage of iJ-DP4 is that using one coupling constant reduces the number of conformations that must be computed by 84%, yet maintains a probability of getting the correct assignment at 99.2% or better. Using two coupling constants to constrain conformations means that only 7% of all of the conformations need to be samples, and the predictive power is maintained.

1

2
Both of these new methods clearly deserve further application.


References

1. Grimblat, N.; Gavín, J. A.; Hernández Daranas, A.; Sarotti, A. M., “Combining the Power of J Coupling and DP4 Analysis on Stereochemical Assignments: The J-DP4 Methods.” Org. Letters 201921, 4003-4007, DOI: 10.1021/acs.orglett.9b01193.


InChIs

1: InChI=1S/C15H21Br2ClO4/c1-8-15(20)14-6-10(17)12(19)7-11(18)13(22-14)5-9(21-8)3-2-4-16/h3-4,8-15,19-20H,5-7H2,1H3/t2-,8-,9+,10-,11+,12+,13+,14+,15-/m0/s1
InChIKey=APNVVMOUATXTFG-NTSAAJDMSA-N
2: InChI=1S/C15H21Br2ClO4/c1-8-15(20)14-6-10(17)12(19)7-11(18)13(22-14)5-9(21-8)3-2-4-16/h3-4,8-15,19-20H,5-7H2,1H3/t2-,8-,9-,10-,11+,12+,13+,14+,15-/m0/s1
InChIKey=APNVVMOUATXTFG-SSBNIETDSA-N



'
This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.

Wednesday, June 26, 2019

The logic of translating chemical knowledge into machine-processable forms: A modern playground for physical-organic chemistry

Karol Molga, Ewa P. Gajewska, Sara Szymkuć, and Bartosz A. Grzybowski (2019)
Highlighted by Jan Jensen
Figure 11 from the paper (c) RSC

This paper offers a, to me, fascinating "look behind the scenes" of Chematica. At the core this program has 75,000 handcrafted reaction rules (SMARTS and Reaction SMARTS strings as shown in the above figure) extracted from the literature (which took over a decade). The authors estimate that there ca 3000-5000 new reaction classes/types appearing in the literature each years and "that there are on the order of 100,000 distinct reaction classes constituting the body of modern organic chemistry. So their work is almost done :).

The paper does a really excellent job of outlining the challenges involved in constructing these rules and present several cases where the rules must be augmented by ML, MM, and Hückel calculations in order to take non-local structural (e.g. strain and steric hindrance) and electronic effects (e.g. on regioselectivity) into account. Such calculations must be done on the millisecond time scale as many thousand intermediates must be inspected during a retrosynthetic search. At the same time they must be very accurate as inaccuracies accumulate with each step on the retrosynthetic path.

It will be very interesting to see if purely ML-based alternatives can beat this approach!


This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, June 12, 2019

Vibrational Signatures of Chirality Recognition Between α-Pinene and Alcohols for Theory Benchmarking

Medel, R.; Stelbrink, C.; Suhm, M. A., Angew. Chem. Int. Ed. 2019, 58, 8177
Contributed by Steven Bachrach
Reposted from Computational Organic Chemistry with permission

Can vibrational spectroscopy be used to identify stereoisomers? Medel, Stelbrink, and Suhm have examined the vibrational spectra of (+)- and (-)-α-pinene, (±)-1, in the presence of four different chiral terpenes 2-5.1 They recorded gas phase spectra by thermal expansion of a chiral α-pinene with each chiral terpene.


For the complex of 4 with (+)-1 or (-)-1 and 5 with (+)-1 or (-)-1, the OH vibrational frequency is identical for the two different stereoisomers. However, the OH vibrational frequencies differ by 2 cm-1 with 3, and the complex of 3/(+)-1 displays two different OH stretches that differ by 11 cm-1. And in the case of the complex of α-pinene with 2, the OH vibrational frequencies of the two different stereoisomers differ by 11 cm-1!

The B3LYP-D3(BJ)/def2-TZVP optimized geometry of the 2/(+)-1 and 2/(-)-1 complexes are shown in Figure 2, and some subtle differences in sterics and dispersion give rise to the different vibrational frequencies.

2/(+)-1

2/(-)-1
Figure 2. B3LYP-D3(BJ)/def2-TZVP optimized geometry of the 2/(+)-1 and 2/(-)-1

Of interest to readers of this blog will be the DFT study of these complexes. The authors used three different well-known methods – B3LYP-D3(BJ)/def2-TZVP, M06-2x/def2-TZVP, and ωB97X-D/def2-TZVP – to compute structures and (most importantly) predict the vibrational frequencies. Interestingly, M06-2x/def2-TZVP and ωB97X-D/ def2-TZVP both failed to predict the vibrational frequency difference between the complexes with the two stereoisomers of α-pinene. However, B3LYP-D3(BJ)/def2-TZVP performed extremely well, with a mean average error (MAE) of only 1.9 cm-1 for the four different terpenes. Using this functional and the larger may-cc-pvtz basis set reduced the MAE to 1.5 cm-1 with the largest error of only 2.5 cm-1.

As the authors note, these complexes provide some fertile ground for further experimental and computational study and benchmarking.


Reference

1. Medel, R.; Stelbrink, C.; Suhm, M. A., “Vibrational Signatures of Chirality Recognition Between α-Pinene and Alcohols for Theory Benchmarking.” Angew. Chem. Int. Ed. 201958, 8177-8181, DOI: 10.1002/anie.201901687.


InChIs

(-)-1, (-)-α-pinene: InChI=1S/C10H16/c1-7-4-5-8-6-9(7)10(8,2)3/h4,8-9H,5-6H2,1-3H3/t8-,9-/m0/s1
InChIKey=GRWFGVWFFZKLTI-IUCAKERBSA-N
(+)-1, (-)-α-pinene: InChI=1S/C10H16/c1-7-4-5-8-6-9(7)10(8,2)3/h4,8-9H,5-6H2,1-3H3/t8-,9-/m1/s1
InChIKey=GRWFGVWFFZKLTI-RKDXNWHRSA-N
2, (-)borneol: InChI=1S/C10H18O/c1-9(2)7-4-5-10(9,3)8(11)6-7/h7-8,11H,4-6H2,1-3H3/t7-,8+,10+/m0/s1
InChiKey=DTGKSKDOIYIVQL-QXFUBDJGSA-N
3, (+)-fenchol: InChI=1S/C10H18O/c1-9(2)7-4-5-10(3,6-7)8(9)11/h7-8,11H,4-6H2,1-3H3/t7-,8-,10+/m0/s1
InChIKey=IAIHUHQCLTYTSF-OYNCUSHFSA-N
4, (-1)-isopinocampheol: InChI=1S/C10H18O/c1-6-8-4-7(5-9(6)11)10(8,2)3/h6-9,11H,4-5H2,1-3H3/t6-,7+,8-,9-/m1/s1
InChIKey=REPVLJRCJUVQFA-BZNPZCIMSA-N
5, (1S)-1-phenylethanol: InChI=1S/C8H10O/c1-7(9)8-5-3-2-4-6-8/h2-7,9H,1H3/t7-/m0/s1
InChIKey=WAPNOHKVXSQRPX-ZETCQYMHSA-N



'
This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.

Wednesday, May 29, 2019

Activity-Based Screening of Homogeneous Catalysts through the Rapid Assessment of Theoretically Derived Turnover Frequencies

Matthew D. Wodrich, Boodsarin Sawatlon, Ephrath Solel, Sebastian Kozuch, and Clémence Corminboeuf (2019)
Highlighted by Jan Jensen

Figure 1. Adapted from images in the preprint posted under the CC-BY-NC-ND 4.0 license

LFESRs linearly relate the reaction energies of barrier heights to a single reaction energy. In this work the all the barriers and reaction energies in Figure 1a is computed via the free energy difference between 1 and 4 [ΔG(4)]

The volcano plot is then obtained by plotting the largest free energy difference in the cycle as a function of ΔG(4). In this particular case that is the barrier between 1 and 4 when ΔG(4) is small and the energy difference between 2 and 3 when  ΔG(4) is large. The optimum catalysts is the one with a ΔG(4) for which these two lines meet and one can screen for such catalyst by computing a single free energy difference.

One problem with thus approach is that the largest free energy difference in the cycle is not always directly related to the turn over frequency (TOF), which is what is measured experimentally. In principle, the TOF should be determined by microkinetic modeling for each value of ΔG(4) to find the maximum TOF. But in this work TOFs are efficiently estimated by the energy span model, which basically considers all energy differences in the cycle (e.g. also between 1 and 3).

Using the TOF plot different energy differences between important and the optimum ΔG(4) value decreases (Figure 1b). The points in Figure 1b show the corresponding TOFs computed without the LFESRs and demonstrate the accuracy of this approach.

Monday, April 29, 2019

Exploration of Chemical Compound, Conformer, and Reaction Space with Meta-Dynamics Simulations Based on Tight-Binding Quantum Chemical Calculations

Highlighted by Jan Jensen


The paper describes a new way to search for conformers, chemical reactions, and estimate barriers using the semiempirical GFNn-XTB method using meta-dynamics. A force term is included that scales exponentially with the Cartesian RMSD from previously found structures, thereby forcing the MD explore new areas of phase space. For simulations with more than one molecule it is necessary to add a constraining potential so that the RMSD cannot be increased simply by increasing the distance between molecules. Each individual MD can be relatively short and most of the CPU time is actually spend on energy minimising the snapshots that are saved.

The results depend on a few hyperparameters, so several MD simulations with different values are run in parallel. Because of the extra force the temperature is also a hyperparameters so the method doesn't necessarily tell you what reactions are most likely to occur at, say, 300K.

The conformational search is tested on 22 (mostly) organic molecules and includes the GFN2-xTB energies of the lowest energy conformer for each molecules. This is a valuable benchmark set for other conformational search algorithms designed to find the global minimum.


This work is licensed under a Creative Commons Attribution 4.0 International License.