Tuesday, May 30, 2023

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Adapted from Figures 1 and 3 in the paper. (c) 2023 the authors 

While this fascinating paper is not about chemistry it could easily be applied to chemical problems without further modifications (except for graph convolution), so I feel justified in highlighting it here.

The paper introduces brain-inspired modular training (BIMT) which leads to relatively simple NNs that are easier to interpret. "Brain-inspired" comes from the fact that the brain is not fully connected like most NNs, since it is a 3D entity with physical connections (axons) and longer axons mean slower communication between neurons. The idea is to enforce this modularity during trainings by assigning positions to individual nodes and introducing a length-dependent penalty in the loss function (in addition to conventional L1 regularisation). This is combined with a swap operation that can swap neurons to decrease the loss.

The result is much simpler networks that, at least for relatively simple objectives, are intuitive and easier to interpret as you can see from the figure above. 

The code is available here (Google Colab version) It would be very interesting to apply this to chemical problems!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, April 30, 2023

Virtual Ligand Strategy in Transition Metal Catalysis Toward Highly Efficient Elucidation of Reaction Mechanisms and Computational Catalyst Design

Wataru Matsuoka, Yu Harabuchi, and Satoshi Maeda (2023)
Highlighted by Jan Jensen

This perspective shows how an old computational tool can be adapted to serve a new purpose. When I started in compchem changing, say, a few F atoms to and H atoms in a molecule often made the difference between waiting a few days and a few weeks for the calculations to finish. People therefore developed pseudo H atoms that could mimic the electronic effect of larger atoms or even entire functional groups. Some of these methods were later adapted to serve as boundary atoms in QM/MM calculations and now they have found a new use in screening for ligands in organometallic catalysts.

The use of pseudoatoms to model such ligands not only speeds up the individual calculations but also maps the chemical space on to just two dimensions, electronic and steric, that allows the space to be searched more efficiently. Once the desired combination of electronics and sterics is found corresponding real ligands are found by another, much faster, screen if commercially available or synthetically accessible ligands.

The authors use this approach to identify two phosphine ligands for a chemoselective Suzuki–Miyaura cross-coupling catalyst, complete with experimental verification.

The downside is that the parameterisation of these "virtual ligands" are a bit involved and very ligand-dependent. But an interesting approach non-the-less.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, March 29, 2023

eChem: A Notebook Exploration of Quantum Chemistry

Thomas Fransson, Mickael G. Delcey, Iulia Emilia Brumboiu, Manuel Hodecker, Xin Li, Zilvinas Rinkevicius, Andreas Dreuw, Young Min Rhee, and Patrick Norman (2023)
Highlighted by Jan Jensen

eChem is an e-book that mixes text and code to teach quantum chemistry. The code is based on VeloxChem, which is a Python-based open source quantum chemistry software package. 

While you can use VeloxChem to perform standard quantum chemical calculations, the really cool thing is that it gives you easy access to the basis setintegrals and orbitals, DFT grids and functionals, etc. This in turn allows you to write your own SCF or Kohn-Sham-SCF procedure. It's sorta like Szabo and Ostlund updated and taken to the next level. 

If you truly want to understand quantum chemistry this is the way to go! One of the co-authors, Xin Li, very kindly got it working on Google Colab, so it is very easy to start playing around with it yourself. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, February 27, 2023

Prediction of High-Yielding Single-Step or Cascade Pericyclic Reactions for the Synthesis of Complex Synthetic Targets

Tsuyoshi Mita, Hideaki Takano, Hiroki Hayashi, Wataru Kanna, Yu Harabuchi, K. N. Houk, and Satoshi Maeda (2022)
Highlighted by Jan Jensen

This paper has been on my to-do list for a while, but Derek Lowe beat me to it (again). DFT-based reaction prediction has yet to make an impact on synthesis planning due to the fact that many are complexities we still have to deal with efficiently, such as solvent effects in ionic mechanisms (very hard to predict accurately), catalysts and additives, chirality, and, well, just the sheer size of the reaction space. 

While these things will be dealt with in good time, it makes sense to see if there are any low-hanging fruits that can be picked under the current limitations, that still have "real life" applications. And this study did just that, by choosing pericyclic reactions. These are very popular reactions in organic synthesis, but require no catalysts nor additives and have minimal solvent effects. Furthermore, some use cases of this reaction in natural product synthesis can be very hard to spot, even for seasoned synthetic chemists, and the authors show that their algorithm can predict it a priori. So this could potentially be a useful tool for specific types synthesis planning.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, January 30, 2023

Machine-Learning-Guided Discovery of Electrochemical Reactions

Andrew F. Zahrt, Yiming Mo, Kakasaheb Y. Nandiwale, Ron Shprints, Esther Heid, and Klavs F. Jensen (2022)
Highlighted by Jan Jensen

Derek Lowe has highlighted the chemical aspects of this work already, so here I focus on the machine learning, which is pretty interesting. The authors want to predict whether a molecule will react with 4-dicyanobenzene anion after it is oxized at a cathode. They have 141 data points of which 42% show a reaction.

They tested several classification models using Morgan fingerprints as the molecular representation, but got at accuracy of only 60%. The then reasoned that the accuracy could be improved by using DFT features. However, rather than using molecular features they decided to use atomic features from an NBO analysis on the radical cation, neutral, radical anion. The feature vector was then tested on several data sets and shown to perform well.

The question is then how to combine the atomic feature vectors to a molecular representation for the reaction classification. The usual way is graph convolution but that'll require more than 141 data points to optimise. So instead they use graph2vec, which is an unsupervised learning method so it is easy to create arbitrarily large training sets. Graph2vec is analogous to word2vec (or, more accurately, doc2vec) which creates vector representations of words by predicting context in text (i.e. words that often appear close to the word of interest). For graph2vec the context is subgraphs of the input graph. 

The graph2vec embedder was then trained on 38k molecules (note that this requires 38k DFT calculations). Using this representation, the accuracy for the reaction classifier increased to 74%, which is a significant improvement compared to Morgan fingerprints. The classifier was then applied to the 38k molecules and 824 were predicted to be reactive. Twenty of these were selected for experimental validation and 16 (80%) were shown to be reactive. That's not a bad hit rate!

I was not aware of graph2vec before reading this paper and it seems like a very promising alternative to graph convolution, especially in the low data regime.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Friday, December 30, 2022

On the potentially transformative role of auxiliary-field quantum Monte Carlo in quantum chemistry: A highly accurate method for transition metals and beyond

James Shee, John L. Weber, David R. Reichman, Richard A. Friesner, and Shiwei Zhang (2022)
Highlighted by Jan Jensen

Figure 1 from this paper. (c) the authors

This paper highlights a big problem in the field of quantum chemistry and posits that a solution may be right around the corner. The problem is that we still can't routinely predict the thermochemistry of TM-containing compounds with the same degree of accuracy as we can for organic molecules. The main reason is that the former systems often have a high-degree of non-dynamic correlation which means that our CCSD(T) often does not give reliable results. We can model the non-dynamic correlation with CASSCF, but there is no good way to compute the dynamic correlation based on a CASSCF wavefunction. So when different DFT functional results give wildly different predictions for your TM-compound there is no way to tell which method, if any, if the best.

This paper argues that phaseless auxiliary-field quantum Monte Carlo (ph-AFQMC) may be the solution to this problem. ph-AFQMC represents the ground state as a stochastic linear combination of Slater determinants mapped as open-ended random walks starting from a trial wavefunction. The method accounts for both non-dynamic and dynamic correlation and the paper argues that chemical accuracy can be achieved with a few hundred random walks, which can be run in parallel and on GPUs.

So what's missing? According to the authors some of the improvements needed include: more efficient ways of reaching the CBS limit, more efficient random walks and a general, automatable protocol to generate optimal trial wave functions. Let's hope these improvements will be made soon, so we can explore a much larger portion of chemical space with confidence.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, November 30, 2022

Quantum Chemical Data Generation as Fill-In for Reliability Enhancement of Machine-Learning Reaction and Retrosynthesis Planning

Alessandra Toniato, Jan P. Unsleber, Alain C. Vaucher, Thomas Weymuth, Daniel Probst, Teodoro Laino, and Markus Reiher (2022)
Highlighted by Jan Jensen

Part of Figure 7 from the paper. (c) The authors 2022. Reproduced under the CC BY NC ND 4.0 license

This is the first paper I have seen on combining automated QM-reaction prediction with ML-based retrosynthesis prediction. The idea itself is simple: for ML-predictions with low confidence (i.e. few examples in the training data) can automated QM-reaction prediction be used to check whether the proposed reaction is feasible, i.e. whether it is the reaction path with the lowest barrier?  If so, it could also be used to augment the training data.

The paper considers two examples using the Chemoton 2.0 method: one where the reaction is an elementary reaction and one where there are two steps (the Friedel-Crafts reaction shown above). It works pretty well for the former, but runs into problems for the latter.

One problem for non-elementary reactions is that one can't predict which atoms are chemically active from the overall reaction. Chemoton therefore must consider reactions involving all atom pairs and preferably more pairs of atoms simultaneously. The number of required calculations quickly gets out of hand and the authors conclude that "For such multistep reactions, new methods to identify the individual elementary steps will have to be developed to maintain the exploration within tight bounds, and hence, within reasonable computing time." 

However, even when they specify the two elementary steps for the Friedel-Crafts reaction, their method fails to find the second elementary step. The reason for this failure is not clear but could be due to the semiempirical xTB used for efficiency.

So the paper presents an interesting and important challenge to computational chemistry community. I wish more papers did this.

This work is licensed under a Creative Commons Attribution 4.0 International License.