Wednesday, March 29, 2023

eChem: A Notebook Exploration of Quantum Chemistry

Thomas Fransson, Mickael G. Delcey, Iulia Emilia Brumboiu, Manuel Hodecker, Xin Li, Zilvinas Rinkevicius, Andreas Dreuw, Young Min Rhee, and Patrick Norman (2023)
Highlighted by Jan Jensen

eChem is an e-book that mixes text and code to teach quantum chemistry. The code is based on VeloxChem, which is a Python-based open source quantum chemistry software package. 

While you can use VeloxChem to perform standard quantum chemical calculations, the really cool thing is that it gives you easy access to the basis setintegrals and orbitals, DFT grids and functionals, etc. This in turn allows you to write your own SCF or Kohn-Sham-SCF procedure. It's sorta like Szabo and Ostlund updated and taken to the next level. 

If you truly want to understand quantum chemistry this is the way to go! One of the co-authors, Xin Li, very kindly got it working on Google Colab, so it is very easy to start playing around with it yourself. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, February 27, 2023

Prediction of High-Yielding Single-Step or Cascade Pericyclic Reactions for the Synthesis of Complex Synthetic Targets

Tsuyoshi Mita, Hideaki Takano, Hiroki Hayashi, Wataru Kanna, Yu Harabuchi, K. N. Houk, and Satoshi Maeda (2022)
Highlighted by Jan Jensen

This paper has been on my to-do list for a while, but Derek Lowe beat me to it (again). DFT-based reaction prediction has yet to make an impact on synthesis planning due to the fact that many are complexities we still have to deal with efficiently, such as solvent effects in ionic mechanisms (very hard to predict accurately), catalysts and additives, chirality, and, well, just the sheer size of the reaction space. 

While these things will be dealt with in good time, it makes sense to see if there are any low-hanging fruits that can be picked under the current limitations, that still have "real life" applications. And this study did just that, by choosing pericyclic reactions. These are very popular reactions in organic synthesis, but require no catalysts nor additives and have minimal solvent effects. Furthermore, some use cases of this reaction in natural product synthesis can be very hard to spot, even for seasoned synthetic chemists, and the authors show that their algorithm can predict it a priori. So this could potentially be a useful tool for specific types synthesis planning.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, January 30, 2023

Machine-Learning-Guided Discovery of Electrochemical Reactions

Andrew F. Zahrt, Yiming Mo, Kakasaheb Y. Nandiwale, Ron Shprints, Esther Heid, and Klavs F. Jensen (2022)
Highlighted by Jan Jensen

Derek Lowe has highlighted the chemical aspects of this work already, so here I focus on the machine learning, which is pretty interesting. The authors want to predict whether a molecule will react with 4-dicyanobenzene anion after it is oxized at a cathode. They have 141 data points of which 42% show a reaction.

They tested several classification models using Morgan fingerprints as the molecular representation, but got at accuracy of only 60%. The then reasoned that the accuracy could be improved by using DFT features. However, rather than using molecular features they decided to use atomic features from an NBO analysis on the radical cation, neutral, radical anion. The feature vector was then tested on several data sets and shown to perform well.

The question is then how to combine the atomic feature vectors to a molecular representation for the reaction classification. The usual way is graph convolution but that'll require more than 141 data points to optimise. So instead they use graph2vec, which is an unsupervised learning method so it is easy to create arbitrarily large training sets. Graph2vec is analogous to word2vec (or, more accurately, doc2vec) which creates vector representations of words by predicting context in text (i.e. words that often appear close to the word of interest). For graph2vec the context is subgraphs of the input graph. 

The graph2vec embedder was then trained on 38k molecules (note that this requires 38k DFT calculations). Using this representation, the accuracy for the reaction classifier increased to 74%, which is a significant improvement compared to Morgan fingerprints. The classifier was then applied to the 38k molecules and 824 were predicted to be reactive. Twenty of these were selected for experimental validation and 16 (80%) were shown to be reactive. That's not a bad hit rate!

I was not aware of graph2vec before reading this paper and it seems like a very promising alternative to graph convolution, especially in the low data regime.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, December 30, 2022

On the potentially transformative role of auxiliary-field quantum Monte Carlo in quantum chemistry: A highly accurate method for transition metals and beyond

James Shee, John L. Weber, David R. Reichman, Richard A. Friesner, and Shiwei Zhang (2022)
Highlighted by Jan Jensen

Figure 1 from this paper. (c) the authors

This paper highlights a big problem in the field of quantum chemistry and posits that a solution may be right around the corner. The problem is that we still can't routinely predict the thermochemistry of TM-containing compounds with the same degree of accuracy as we can for organic molecules. The main reason is that the former systems often have a high-degree of non-dynamic correlation which means that our CCSD(T) often does not give reliable results. We can model the non-dynamic correlation with CASSCF, but there is no good way to compute the dynamic correlation based on a CASSCF wavefunction. So when different DFT functional results give wildly different predictions for your TM-compound there is no way to tell which method, if any, if the best.

This paper argues that phaseless auxiliary-field quantum Monte Carlo (ph-AFQMC) may be the solution to this problem. ph-AFQMC represents the ground state as a stochastic linear combination of Slater determinants mapped as open-ended random walks starting from a trial wavefunction. The method accounts for both non-dynamic and dynamic correlation and the paper argues that chemical accuracy can be achieved with a few hundred random walks, which can be run in parallel and on GPUs.

So what's missing? According to the authors some of the improvements needed include: more efficient ways of reaching the CBS limit, more efficient random walks and a general, automatable protocol to generate optimal trial wave functions. Let's hope these improvements will be made soon, so we can explore a much larger portion of chemical space with confidence.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, November 30, 2022

Quantum Chemical Data Generation as Fill-In for Reliability Enhancement of Machine-Learning Reaction and Retrosynthesis Planning

Alessandra Toniato, Jan P. Unsleber, Alain C. Vaucher, Thomas Weymuth, Daniel Probst, Teodoro Laino, and Markus Reiher (2022)
Highlighted by Jan Jensen

Part of Figure 7 from the paper. (c) The authors 2022. Reproduced under the CC BY NC ND 4.0 license

This is the first paper I have seen on combining automated QM-reaction prediction with ML-based retrosynthesis prediction. The idea itself is simple: for ML-predictions with low confidence (i.e. few examples in the training data) can automated QM-reaction prediction be used to check whether the proposed reaction is feasible, i.e. whether it is the reaction path with the lowest barrier?  If so, it could also be used to augment the training data.

The paper considers two examples using the Chemoton 2.0 method: one where the reaction is an elementary reaction and one where there are two steps (the Friedel-Crafts reaction shown above). It works pretty well for the former, but runs into problems for the latter.

One problem for non-elementary reactions is that one can't predict which atoms are chemically active from the overall reaction. Chemoton therefore must consider reactions involving all atom pairs and preferably more pairs of atoms simultaneously. The number of required calculations quickly gets out of hand and the authors conclude that "For such multistep reactions, new methods to identify the individual elementary steps will have to be developed to maintain the exploration within tight bounds, and hence, within reasonable computing time." 

However, even when they specify the two elementary steps for the Friedel-Crafts reaction, their method fails to find the second elementary step. The reason for this failure is not clear but could be due to the semiempirical xTB used for efficiency.

So the paper presents an interesting and important challenge to computational chemistry community. I wish more papers did this.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, October 31, 2022

Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory

Frank Hu, Francis He, David J. Yaron (2022) 
Highlighted by Jan Jensen

Figure 7 from the paper. (c) The authors 2022. Reproduced under the BY-NC-ND licence

This paper uses ML techniques and algorithms (specifically PyTorch) to fit DFTB parameters, which results in a semiempirical quantum method (SQM) that has an accuracy similar to DFT. The advantage of such a physics-based method over a pure ML-based is that it is likely to be more transferable and requires much less training data. This should make it much easier to extend to other elements and new molecular properties, such as barriers.

Parameterising SQMs is notoriously difficult as the molecular properties depend exponentially on many of the parameters. As a result, most SQMs used today have parameterised by hand. The paper presents several methodological tricks to automate the fitting.

One is the use of high-order polynomial spline functions to describe how the Hamiltonian elements depend the fitting-parameters. The functions allow the computation of not only of the first derivative needed for back propagation, but also high-order derivatives, which are used for regularisation to avoid overfitting and keeping the parameters physically reasonable. Finally, the SCF and training loops are inverted to that the he charge fluctuations needed for the Fock operator are updated based on the current model parameters every 10 epochs. This enables computationally efficient back propagation during training, which is important because the training set is on the order of 100k.

Another neat feature is that the final model is simply a parameter file (SKF file), which can be read by most DFTB programs. So there is nothing new for the user to implement. However, currently the implementation is only for CNHO.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, September 30, 2022

Active Learning for Small Molecule pKa Regression; a Long Way To Go

Paul G. Francoeur, Daniel PeƱaherrera, and David R. Koes (2022)
Highlighted by Jan Jensen

Parts of Figures 5 and 6. (c) The authors 2022. Reproduced under the CC-BY licence

One approach to active learning is to grow the training set with molecules for which the current model has the highest uncertainties. However,  according to this study, this approach does not seem to work for small-molecule pKa prediction where active learning and random selection give the same results (within the relatively high standard deviations) for three different uncertainty estimated. 

The authors show that there are molecules in the pool that can increase the  initial accuracy drastically, but that the uncertainties don't seem to help identify these molecules. The green curve above is obtained by exhaustively training a new model for every molecule in the pool during each step of the active learning  loop and selecting the molecule that gives the largest increase in accuracy for the test set. Note that the accuracy decreases towards the end meaning that including some molecules in the training set diminishes the performance.

The authors offer the following explanation for their observations: "We propose that the reason active  learning failed in this pKa prediction task is that all of the molecules are informative."

That's certainly not hard to imagine given the is the small size of the initial training set (50). It would have been very instructive to see the distribution of uncertainties for the initial models. Does every molecule have roughly the same (high) uncertainty? If so, the uncertainties would indeed not be informative. 

Also, uncertainties only correlate with (random) errors on average. The authors did try adding molecules in batches, but the batch size was only 10. 

It would have been interesting to see the performance if one used the actual error, rather than the uncertainties, to select molecules. That would test the case where uncertainties correlate perfectly with the errors.

This work is licensed under a Creative Commons Attribution 4.0 International License.