Computational Chemistry Highlights

Density Functional Theory Surrogate Enables Fast and Broad Computational Evaluation of Homogeneous Transition Metal Catalytic Energy Landscapes

2026-04-30T10:49:00.004+02:00

Kevin P. Quirion, Wang-Yeuk Kong, Britton Stanley, Jyothish Joy, and Daniel H. Ess (2026)
Highlighted by Jan Jensen

It has been about 10 months since Meta FAIR released the Universal Models for Atoms, or UMA, machine-learning interatomic potentials. Since then, the first independent benchmarking studies have begun to appear, and this paper by Quirion and co-workers asks a very practical question: can UMA be used as a fast surrogate for DFT in homogeneous organometallic catalysis?

The authors examine seven catalytic/organometallic case studies taken from the literature, including Ir pincer alkane dehydrogenation, Rh hydroformylation, Ru olefin metathesis, Pd Buchwald–Hartwig amination, Cu-catalyzed difluorocarbene insertion, Ni asymmetric radical capture/reductive elimination, and a dinuclear Ni–Ni naphthyridine-diimine cycloaddition.

For literature geometries, they recompute reaction energies using ωB97M-V/def2-TZVPD single points, which is close to the level of theory that UMA is trained to reproduce. They then compare these values to UMA-S and UMA-M single-point energies, and in many cases also to UMA-optimized structures and energies.

The headline result is encouraging: in most cases, UMA tracks ωB97M-V very well, often within a few kcal/mol and with good agreement in relative barriers and reaction-profile shapes. This is particularly impressive because the systems include different metals, oxidation-state changes, large ligands, charged species, and transition states. For routine conformer screening, preliminary mechanism mapping, or fast evaluation of many candidate catalysts, this suggests UMA could be genuinely useful.

There are, however, two important problem cases.

The first is the Cu-catalyzed difluorocarbene insertion, where the key issue is an open-shell singlet intermediate. UMA could not locate the TS1e transition state during optimization or NEB, gave unphysical conformational changes when optimizing the singlet 3e, and predicted the triplet state of 3e to be much lower than the singlet. At first glance this looks like a UMA failure, but ωB97M-V itself has similar problems with the singlet–triplet energetics. So this is not simply a machine-learning-potential problem. UMA is trained to reproduce ωB97M-V-like energies and forces; it should not be expected to magically repair failures of the underlying DFT reference method. The more specific concern is that UMA also has practical difficulties optimizing the open-shell singlet surface and locating the associated transition state. It was not tested whether ωB97M-V had the same problem.

The second problem case is the dinuclear Ni–Ni naphthyridine-diimine diene cycloaddition. Here UMA struggles with the relative spin states and barriers. In particular, it does not reproduce the same doublet/quartet ordering as ωB97M-V, and it overstabilizes some parts of the profile. This is perhaps less surprising because OMol25 did not include multinuclear transition-metal complexes, and the authors note that the naphthyridine-diimine ligand is not represented in the training set. Interestingly, the optimized geometries are not disastrous: UMA-S gives heavy-atom RMSDs of roughly 0.22 Å for the doublet and 0.36 Å for the quartet relative to the reported M06-L structures. So the failure is more severe for relative energetics and spin-state ordering than for generating plausible structures.

Overall, the study is a strong endorsement of UMA as a practical tool for organometallic mechanism work, provided it is used with the same caution one would apply to DFT. UMA appears especially promising for rapid conformer screening, approximate reaction-profile generation, and preoptimization before higher-level single-point calculations.

One unresolved issue is training-set overlap. The authors write that the OMol25 training database is so large that it “cannot be easily queried,” and that UMA does not provide an intrinsic nearest-neighbor or structure-comparison analysis for new inputs. That is a real limitation: if a benchmark system, or something very close to it, is already in the training data, the benchmark is much less informative about out-of-distribution generalization.

At the same time, the paper also states that the authors queried the dataset for the naphthyridine-diimine ligand and provide code in the Supporting Information. So the situation is somewhat unclear. The database may be inconvenient to search, but it does not seem impossible to search. For future UMA benchmark studies, it would be very useful to include at least a basic training-set check: for example, filtering OMol25 by metal, composition, charge, spin state, ligand identity, and local coordination environment. This would help distinguish cases where UMA is genuinely extrapolating from cases where it is interpolating within a familiar chemical neighborhood.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Stochastic tensor contraction for quantum chemistry

2026-03-25T14:01:00.005+01:00

Jiace Suna and Garnet Kin-Lic Chan (2026)
Highlighted by Jan Jensen

What this paper lacks in terms of punchy title, it makes up for in content. I guess I would have gone with something like "Monte Carlo Meets Coupled Cluster: Slashing the Cost of CCSD(T)" or "Stochastic Tensor Contraction Pushes CCSD(T) Toward Mean-Field Cost".

Anyway, tensor contraction is the algebraic core of much of quantum chemistry: large multidimensional arrays representing amplitudes and integrals are multiplied and summed over shared indices to produce energies and intermediates. It matters because these contractions set the scaling wall for methods like CCSD(T), where the formal cost rises far faster than Hartree–Fock.

This study uses importance samplling to evaluate the tensor contraction, Importance sampling means drawing the most important terms in a sum more often than the unimportant ones, while reweighting so the final estimator stays unbiased. Here, Sun and Chan use it to evaluate high-order tensor contractions stochastically.

The headline result is that stochastic tensor contraction (STC) drives the scaling of CCSD(T) down dramatically: from the usual O(N^6) and O(N^7) down to O(N^4). In practice, water-cluster tests show very large FLOP reductions and wall-time crossovers at surprisingly small sizes.

Figure 7 in the paper is the real selling point, because it compares against the incumbent approximate workhorse, DLPNO-CCSD(T), on 20 realistic molecules. STC is faster than DLPNO for every system in the set, with speedups ranging from 2.5× to 32×, while also delivering smaller errors than all DLPNO/Normal results and 15 of 20 DLPNO/Tight results. Just as importantly, the STC errors stay tightly clustered around the chosen target of 0.2 kcal/mol, whereas DLPNO errors vary much more from system to system. That makes STC look not just fast, but controllable.

Table 3 sharpens that message. Averaged over the benchmark set, STC has a mean absolute error of 0.2 kcal/mol at a geometric mean runtime of 10.7 min, compared with 3.00 kcal/mol / 58 min for DLPNO/Normal, 0.70 kcal/mol / 159 min for DLPNO/Tight, and 773 min for exact CCSD(T). So the paper’s central claim is not merely better asymptotic scaling, but a roughly order-of-magnitude win in both time and error relative to state-of-the-art local correlation in this benchmark.

One caveat: while the speed-up is undeniably impressive, another likely limiting factor is memory. The paper notes the use of density fitting “to reduce memory requirements,” but does not really quantify memory use or memory scaling in the same systematic way as FLOPs and wall time. Given that modern CC implementations are often limited as much by storage and movement of intermediates as by raw arithmetic, that omission stands out.

Overall, this is prototype code, but very exciting prototype code. It will be very interesting to see whether this stochastic route can mature into something that genuinely displaces DLPNO-CCSD(T) as the default reduced-cost gold-standard method. Code: GitHub repository

This work is licensed under a Creative Commons Attribution 4.0 International License.

Classical solution of the FeMo-cofactor model to chemical accuracy and its implications

2026-02-28T09:43:00.001+01:00

Huanchen Zhai, Chenghan Li, Xing Zhang, Zhendong Li, Seunghoon Lee, and Garnet Kin-Lic Chan (2026)
Highlighted by Jan Jensen

The FeMo cofactor in nitrogenase enzymes is often mentioned as the killer application of quantum computing (QC) in chemistry. That is due to its complex electronic structure, which has made is difficult to model accurately. However, Chan and co-workers now claim to have computed the electronic energy to, by their estimate, chemical accuracy by conventional means.

They have done so by a series of calculations as indicated in the figure above. The CPU requirements are not given in detail, but the authors point out that no supercomputer was needed.

Interestingly, the authors found that the ground state wavefunction is not inherently strongly multireference. Rather the main challenge is to identify the correct (mostly) single-reference state.

Where does that leave chemical applications of QC? For one thing, it moves the goalpost further back. The active space is the one typically used to estimate QC requirements, but it may have to be expanded to include MOs from the surrounding protein to accurately capture the chemistry, which would require even larger quantum computer. But that will be even further into the future with plenty of time for conventional approached to get there first.

In my opinion, the case for QC-based quantum chemistry was never very strong, and this study is just another blow.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Predicting Enantioselectivity via Kinetic Simulations on Gigantic Reaction Path Networks

2026-01-28T15:02:00.000+01:00

Yu Harabuchi, Ruben Staub, Min Gao, Nobuya Tsuji, Benjamin List, Alexandre Varnek, and Satoshi Maeda (2026)
Highlighted by Jan Jensen

The automated predict of chemical reaction networks have thus far been limited to relatively small systems, typically with less than 50 atoms (including Hs) due to computational expense. This study goes significantly beyond this by studying a system with 228 atoms.

This is made possible by three things:

1. While the system is big, the reaction is relatively simple, so the reaction network is relatively small.

The reaction is an acid-catalysed cyclisation reaction involving a relatively small and chemically simple molecules. It is the (chiral) acid catalyst that contributes most of the atoms. The reaction itself has three steps: protonation of alkene group, intramolecular C-O bond formation on the activated alkene, deprotonation of the O to regenerate the catalyst. Most of the atoms are chemically inert, and there are 12 chemically active atoms (defined by the user). In all, the study identified 74 possible intermediates/products and only about half of those are chemically distinct if you ignore chirality.

2. Cheap surrogate energy function

They use a Δ-ML approach that corrects the xTB energy and gradient to obtain better accuracy. The ML model is trained on-the-fly against DFT calculations.

3. Massive computational resources

In spite of 1 and 2 they this study required massive computational resources. They don't address this point specifically, other than to mention that it requires millions of gradient evaluations, but Maeda stressed this point during his talk at the WATOC last year.

So this is not exactly a routine application.

This work is licensed under a Creative Commons Attribution 4.0 International License.

One step retrosynthesis of drugs from commercially available chemical building blocks and conceivable coupling reactions

2025-12-31T11:56:00.000+01:00

Babak Mahjour, Felix Katzenburg, Emil Lammi, and Tim Cernak (2025)
Highlighted by Jan Jensen

What are important reactions that we currently can't perform? I asked myself this a few years ago and found that there were very few papers in the literature that addressed this. It turns out that I possessed the skills to figure it out for myself if I had only had the idea. The idea being that "the most valuable couplings would utilize the most abundant building blocks to form the most common types of bonds found in [a] target dataset."

As an example, the authors took a list of 9028 known drugs and asked how many could potentially be made in a single step from molecules in the MilliporeSigma catalog by hypothetical coupling reactions. The answer turns out to be 2573 (28%), which is a surprisingly large number. The most common reaction was the coupling of alkyl alcohols and alkyl amines, followed by alkyl acid-alkyl amine and alkyl acid-alkyl alcohols. All reaction for which there's no robust and generally applicable synthetic protocol, although AFAIK, although Zhang and Cernak took a stab at the alkyl acid-alkyl amine coupling.

I really wish there were more papers like this. Identifying important questions to work on is just as important as solving them, and the latter is almost always a communal effort.

This work is licensed under a Creative Commons Attribution 4.0 International License.

From Random Determinants to the Ground State

2025-11-27T13:07:00.000+01:00

Hao Zhang and Matthew Otten (2025)
Highlighted by Jan Jensen

The paper introduces a method they call TrimCI that very efficiently finds a relatively small set of determinants that accurately describes strongly correlated systems. (Well, it actually works for any system, but the main advantage is for strongly correlated systems).

Unlike most new correlation methods, this one is actually simple enough to describe in a few sentences. TrimCI starts by constructing a set of orthogonal (non-optimised!) MOs (e.g. by diagonalising the AO overlap matrix). From these MOs you construct a small number of random determinants (e.g.100), construct the wavefunction (i.e. construct the Hamiltonian matrix and diagonalise, as per usual). Then you compute all the Hamiltonian elements between this wavefunction ($H_{ij}$) and the remaining determinants and add determinants with sufficiently large |$H_{ij}$| to the wavefunction. Finally, there is the trimming step "which removes negligible basis states by first diagonalising randomised blocks of the core and then performing a global diagonalising step on the surviving set." And repeat.

The authors find that this approach converges much quicker than other similar methods, using many fewer determinants. Another big advantage is that the method does not require a single-determinant ground state as a starting point and is thus not sensitive to how much such a single-determinant deviates from the actual wavefunction.

So, what's the catch here? In order to be practically useful, we need to compute energy differences with mHa accuracy, and I did not see any TrimCI results for chemical systems where the energy had converged to that kind of accuracy. It's possible that error cancellation can help here, but that needs to be investigated. The authors do look at extrapolation, which looks promising, but needs to be systematically investigated. Yet another option is to use the (compact) TrimCI wavefunction as an ansatz for dynamic-correlation methods.

It's also not clear what AO basis set it used for some of these calculations (including the one shown above). I suspect small basis sets are used and even FCI energies with very small basis sets are of limited practical use. Are the TrimCI calculations on large systems still practical with more realistic basis sets?

Nevertheless, this seems like a very promising step in the right direction.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Electron flow matching for generative reaction mechanism prediction

2025-10-31T10:45:00.001+01:00

Joonyoung F. Joung, Mun Hong Fong, Nicholas Casetti, Jordan P. Liles, Ne S. Dassanayake & Connor W. Coley (2025)
Highlighted by Jan Jensen

While the title says reaction mechanism prediction, it's really reaction mechanism-based reaction outcome prediction. The approach uses glow matching (a generalization of diffusion-based approaches) to predict changes to the bond-electron (BE) matrix (basically connectivity matrix with the lone pair electron count on the diagonal), thus ensuring mass and charge conservation because changes in the BE matrix are constrained to sum to 0. The method is trained 1.4 million elementary reaction steps derived primarily from the USPTO dataset.

Recursive predictions yield a complete reaction mechanism step by step, starting from the reactants. (I assume the products are defined as the state where no more changes are predicted.) The method is probabilistic so several different reaction outcomes are possible if the process is repeated, and ranked according to frequency. Another option is to use DFT calculation to rank the different mechanisms.

Like any ML method its applicability is tied to the training set. For example, of 22,000 reactions from patents reported in 2024 that were not assigned a specific reaction class in the Pistachio dataset, the approach successfully recovered products in only 351 cases. However, the authors show that a new reaction class can be added with as few as 32 examples.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Fundamental Study of Density Functional Theory Applied to Triplet State Reactivity: Introduction of the TRIP50 Dataset

2025-09-25T15:39:00.004+02:00

William B. Hughes, Mihai V. Popescu, Robert S. Paton (2025)
Highlighted by Jan Jensen

While this paper presents an interesting and useful benchmark and dataset of barrier heights involving organic molecules in the triplet state, I am highlighting this paper for a different reason.

While compiling the data set the authors "observed a common tendency for triplet SCF calculations to converge non-Aufbau solutions, resulting in catastrophic predictions in both thermochemistry and activation energy barriers and leading to errors as high as 26.4 kcal/mol." They go on to note that "Since such errors cannot be predicted a priori, manual inspection of spin densities for triplet-state calculations can be helpful to ensure the lowest triplet state has been converged with KS-DFT."

I remember the days when the SCF would routinely fail to converge even for simple singlet ground state molecules. So when they did converge and you got odd results, one of the first things you checked was the orbitals. But those days are long gone and I don't think it would occur to me now. I'd be much more likely to ascribe it to some deficiency in the functional.

I now wonder how many of such wrong conclusions are scattered throughout the literature, especially for molecules with "funky" electronic structure, such as transition metal complexes. Manual inspection of the MOs is not going to be a practical option for many of these studies, and SCF stability checks did not identify all problems!

However, most QM packages have several options for the MO guess and it might be a good idea to use more than one of them and check whether they all converge to the same SCF solution. It'll be just like the old days.

This work is licensed under a Creative Commons Attribution 4.0 International License.

UMA: A Family of Universal Models for Atoms

2025-08-29T13:28:00.000+02:00

Brandon M. Wood, Misko Dzamba1, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick (2025)
Highlighted by Jan Jensen

I use xTB extensively in my research and I am often asked why don't switch to machine learning potentials (MLPs) instead. My answer has always been that they have too many limitations: limited atom types, no charged molecules, can't handle reactions, efficiency on CPUs, solvent effects, etc. I know these can be overcome by making bespoke MLPs, then it is not really a simple replacement for xTB, but a whole new workflow.

However, the new UMA MLP from Meta seems to address all but one of my concerns (more on that below). UMA is trained to reproduce DFT energies and gradients calculated for nearly half a billion 3D structures, spanning molecules, surfaces, reactions, etc, containing atoms from virtually all of the periodic table. It is also possible to specify the charge and multiplicity and the cost seems to be comparable to xTB when running on CPUs, when interfaced with ORCA. So this is all very encouraging.

Two main questions remain. One is the accuracy, and by that I mean how faithfully it reproduces ωB97M-V/def2-TZVPD results (in the case of molecules) for molecules outside it's training set. AFAIK nothing is published yet, but encouraging results are being shared online.

The other main question is how to include implicit solvent effects. In cases where it is OK to optimise i the gas phase, one option is to compute the solvation energy with some other method and add it to the gas phase UMA results. Even if you do that at the DFT level, UMA has still saved you a lot if time. However, if the problem requires optimisation in solvent, then you have to use a faster method like xTB to compute the solvent effects on the gradient in order to get any time-savings. Depending on how well xTB does on the system of interest, this could "contaminate" the UMA results. Alternatively, a purely ML approach would basically amount to redoing UMA for molecules with continuum solvation included. Explicit solvation is fine in principle, but impractical for routine applications.

Anyway, before this is resolved there could be some fairly routine applications that still cannot be address satisfactorily with MLPs.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Computing solvation free energies of small molecules with first principles accuracy

2025-07-31T12:59:00.002+02:00

J. Harry Moore, Daniel J. Cole, and Gábor Csányi (2025)
Highlighted by Jan Jensen

Local CCSD(T) has made it possible to reach near chemical accuracy for many real life applications. However, most real life applications happen in solution, where the only realistic option is still continuum solvation methods, which, in general, do offer chemical accuracy (especially for charged systems).

In principle, this can be fixed explicit solvation at the CCSD(T) level but of course the need for sampling makes this practically impossible at present. However, this paper by Moore and co-workers is a step in that direction.

The idea is that ML potentials now approach chemical accuracy and are fast enough for proper sampling. The authors developed a MLP that is compatible with alchemical transformation (needed for sampling efficiency) and showed that experimental logP values (the difference in solvation energy between water and octanol) of drug-like compounds can be predicted within 0.65 kcal/mol (0.45 log units), i.e chemical accuracy.

Furthermore, the calculations took "only" about 4-7 days per molecule (octanol simulations converge slower than water) on a single node, containing either 8 NVIDIA A100, or 8 NVIDIA L40S, GPUs. While this is too slow for routine applications it is fast enough to create benchmark sets. This is great news since experimental solvation energies only are measured for relatively small molecules.

However, there are some caveats. The main one is that only neutral molecules where tested, since the MLP only was trained on neutral compounds, and it is not clear whether the same accuracy can be obtained for charged systems.

This work is licensed under a Creative Commons Attribution 4.0 International License.

g-xTB: A General-Purpose Extended Tight-Binding Electronic Structure Method For the Elements H to Lr (Z=1–103)

2025-06-27T10:54:00.000+02:00

Thomas Froitzheim, Marcel Müller, Andreas Hansen, and Stefan Grimme (2025)
Highlighted by Jan Jensen

This highlight is coming to you live from Oslo where Grimme presented the release of (a preliminary version of) g-xTB at WATOC yesterday. Grimme has been talking about this method for a few years now and many in the community (not least me) have been waiting for this moment with some anticipation.

To cut a long story (and paper!) short g-xTB offers mid-level DFT accuracy at semi-empirical cost (30-50% slower than GFN2-xTVB) including reaction energies and barrier heights. That's quite a statement! Structures of TM-complexes also seems to be improved.

This comes about a month after the release of the general ML-potential UMA and the paper offers some tantalising preliminary comparisons. For example, MAEs for reaction energies and forward barrier heights for the BH9 data shown above are 1.6 and 1.9 kcal/mol, respectively! However, UMA also appears to have some problems with some large extended systems, some non-covalent interactions, and some TM complexes.

Given that UMA and g-xTB are roughly the same costs (on CPUs) it will be interesting to see how these methods will co-evolve over the coming years.

Note that you can apply both methods through ORCA

This work is licensed under a Creative Commons Attribution 4.0 International License.

Repurposing quantum chemical descriptor datasets for on-the-fly generation of informative reaction representations: application to hydrogen atom transfer reactions

2025-05-30T13:04:00.003+02:00

Javier E. Alfonso-Ramos, Rebecca M. Neeser, Thijs Stuyver (2024)
Highlighted by Jan Jensen

If you have very little data, the single most useful thing you can do is find good descriptors. Sigman, Doyle, and others have shown this very nicely for reactivity predictions of transition metal containing catalysts, but there's less systematic work for other types of reactions.

In this paper, Stuyver and co-workers suggest a descriptor set for barriers of hydrogen atom transfer (HAT) reactions that are based valence bond (VB) theory. In practise this translates to computing the bond dissociation energies (BDEs) without relaxing the geometry, and combining them with the BD free energies (BDFE, where ΔBDFE corresponds to ΔGrp). In addition, atomic Mulliken charges, spin densities, and buried volume are also added. All descriptors are predicted by surrogate models to avoid QM-based calculations.

Using these descriptors they get significantly better barrier predictions compared to fingerprint or graph convolution representation, even using simple models such as linear regression. Even the simple Bell-Evans-Polanyi model (a linear model based solely on ΔGrp) outperforms the models using fingerprints and graph convolution, with an R2 of 0.71 compared to 0.65 for graph convolution. For, comparison the R2s for the VB-based descriptors are 0.80-0.85, depending on the ML-model.

I wonder what other approximate chemical methods contain inspirations for new descriptors?

This work is licensed under a Creative Commons Attribution 4.0 International License.

Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

2025-04-30T10:59:00.003+02:00

Nofit Segal, Aviv Netanyahu, Kevin P. Greenman, Pulkit Agrawal, Rafael Gómez-Bombarelli (2025)
Highlighted by Jan Jensen

Pat Walters recently wrote a blog post called Why Don’t Machine Learning Models Extrapolate? where he showed that ML models trained on a low-MW molecules cannot accurately predict the MW of larger molecules, while linear regression appears to do OK. Here I want to highlight a recently proposed ML method by Segal et al. that does appear to be able to extrapolate with decent accuracy.

Before diving into the paper I let me try to explain why some traditional ML methods such as tree-based methods and NNs might struggle with extrapolation, using MW as the property of interest

Linear Regression. Let's start with linear regression (I'll skip the bias for simplicity),

$$y_{pred} = X_1w_1+X_2w_2+... +X_nw_n$$

If $X_i$ is the number of atom $i$ in the molecule and $w_i$ is the atomic weight of atom $i$, then you have a perfect model that is able to extrapolate (assuming all atom types are represented in the training data). However, if X is a binary fingerprint then you obviously won't get good extrapolated values, so the molecular representation can also impacts whether a model can extrapolate accurately (I'll return to this point below). Pat uses count fingerprints, which contains the heavy-atom count, so the model "just" has to learn to infer the H-atom count from the remaining fragments and it does a pretty good job of that.

Random Forest. In RF each tree (i) predicts one of the $y$ values in the training set and the prediction is simply the average of $N$ trees:

$$y_{pred} = (y_1 + y_2 + ... +y_N)/N $$

So the largest possible value for $y_{pred}$ is the maximum $y$ values in the training set $\max(y)$. Thus, RF is fundamentally incapable of any extrapolation.

LGBM. In gradient-boosted tree methods each tree $i$ tries to predicts the deviation from the mean of the training set ($lr$ is the learning rate, which is typically a small number):

$$ y_{pred} = \langle y \rangle +lr(\Delta y_1 + \Delta y_2 + ... + \Delta y_N) $$

The largest possible value of $\Delta y_i$ is the largest deviation from the mean found in the training set, but one can image a combination of $lr$, $N$, and $\Delta y_i$'s where test set-$y_{pred}$ could be larger than any $y$ value in the training set, but it is also unlikely that is it going to be significantly bigger, since it is tied to the mean of the training set. This is indeed what Pat finds.

Feed Forward Neural Networks. For an NN the prediction is a linear combination of the outputs from the activation functions from the last hidden layer:

$$y_{pred} = a_1 w_1+a_2 w_2+... +a_n w_n$$

If the activation function is something like sigmoid or tanh, then the maximum value of $a_i$ is 1, no matter the molecule. So the model is clearly going to have a very hard time extrapolating at all, in analogy with linear regression using binary fingerprints.

With the ReLU activation function the model has some theoretical chance of extrapolating. In fact, if the input is the simple atom-count vector discussed above one could arrive at a perfect model by setting most of the weights to 0, a few to 1, and some of the weights in the last layer to the atomic weights. However, the odds of finding that global error-minimum by traditional training techniques is very small. But note that it is a practical rather than a fundamental limitation, for this particular property and molecular representation. As before, if you use a binary fingerprint it will be impossible to make an method capable of extrapolating, no matter what activation function you use. However, a-ReLU NN can predict values larger than $\max(y)$, given the right molecular representation. Which brings us to ...

Graph Neural Networks. In GNNs (e.g. ChemProp) the molecular representation that is fed to the NN is some combination of atomic descriptor-vectors. The most popular way to combine the atomic vectors is to average them, which will result in molecular descriptor vectors that is roughly independent of molecular size. This will make accurate extrapolation harder, even when using ReLU.

The approach by Segal et al. Segal et al. have developed a method called bilinear transduction, which optimises a bilinear function, with contributions from one model that depends on $X$ and one that depends on $\Delta X$, i.e. the difference between molecules in the test set.

$$ y_{pred} = f_1(\Delta X) g_1(X) + f_2(\Delta X) g_2(X) + ... + f_n(\Delta X) g_n(X)$$

The basic idea is that while the training set may not contain, say, molecules with 15 C atoms, it contains molecules with 10 C atoms, and molecule pairs that differ by 5 C atoms. So if you combine that knowledge you should be able to make a reasonable extrapolation to 15 C atoms.

This turns out to work quite well as you can see from the rightmost panel here (for the FreeSolv data set)

This work is licensed under a Creative Commons Attribution 4.0 International License.

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

2025-03-31T10:15:00.004+02:00

Jules Schleinitz, Alba Carretero-Cerdán, Anjali Gurajapu, Yonatan Harnik, Gina Lee, Amitesh Pandey, Anat Milo, and Sarah Reisman (2025)
Highlighted by Jan Jensen

Have you ever looked at a poorly performing ML model and thought: "Hmm, maybe I should make my training set smaller"? Me neither.

Well, this paper shows an example where that actually works. In particular, they show examples where regioselectivity predictions for some molecules are improved by using only some of the available data. They show that if you start with a very small training set you get the wrong prediction of where in a molecule the reaction occurs and when you add more data the model often makes the right prediction eventually. However, if you keep adding data the model starts making a wrong prediction again! In other words, they get a much better classification model if the make a bespoke training set for each molecule.

This raises two important questions: 1) Does this apply to all datasets and properties? and 2) How do you figure out which data points to include in your training set for a particular molecule when you don't know the right answer?

If the answer to the first question is "no" (and it probably is), then how do we figure out when it this is a good strategy? (other than trial-an-error). I suspect that the predictions of local properties (such as the reactivity of an atom) are more likely to benefit from bespoke training sets, compared to global properties such as solubility. But that is just a guess.

Another guess, it that this will apply mostly to small, inhomogeneous datasets. If so, we could easily generate bespoke models for each individual prediction on-the-fly if that would lead to better predictions. But we need to figure out the answers to question 2 first.

I also think that if we can understand how additional data can hurt a models performance, it would give us some valuable insights into how ML models learn.

This work is licensed under a Creative Commons Attribution 4.0 International License.

GOAT: A Global Optimization Algorithm for Molecules and Atomic Clusters

2025-02-28T13:08:00.004+01:00

Bernardo de Souza (2025)
Highlighted by Jan Jensen

If you want to predict accurate reaction energies and barrier heights of typical organic molecules then you are spending a significant portion of CPU time on the conformational search. While generating a large number of random starting structures often works OK for smaller molecules (with less than, say, 15 rotatable bonds) it fails for larger molecules where the odds of randomly generating the global minimum quickly approach zero. You thus need methods that focus the search arounds low energy regions of the PES.

In essence, the the algorithm walks up in some random direction, detects when a conformation barrier has been crossed, minimises the energy, and decides whether a new conformer has been found. New conformers are then included in the ensemble using a Monte-Carlo criteria with simulated annealing. The process is repeated until no new low energy conformers are found.

GOAT is better or very similar to CREST for all but one organic molecule tested. For organometallic complexes GOAT is better or similar to CREST, except for three cases where CREST fails in some way. For small molecules GOAT is a bit slower than CREST, but for large molecules GOAT is usually considerably faster.

GOAT is thus a valuable addition to the computation chemistry toolbox.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Applying statistical modeling strategies to sparse datasets in synthetic chemistry

2025-01-29T14:08:00.002+01:00

Brittany C. Haas, Dipannita Kalyani, Matthew S. Sigman (2025)
Highlighted by Jan Jensen

A few weeks ago I listened to an online talk by Matt Sigman and one thing that really surprised me was the remarkable success he had with very, very simple decision trees (sometimes just with a single node!). I wanted to learn more about it and luckily for me he has just published this excellent perspective.

First of all, these methods are used out of necessity because the data sets are relatively small (typically <1000 and often <100). So why do they work so well? Each application is on one particular organometallic catalyst and reaction type, and the experimental data usually comes from the same lab. The descriptors are usually obtained by DFT calculations and carry a lot of high quality chemical information. In fact, the authors make the point that if the approach fails they look for new descriptors rather than new ML methods. In fact you can view the single-node decision tree as an automated way of finding the single best descriptor in a collection.

You could of course ask: if you are doing DFT calculations, why not simply compute the barriers of interest rather then using ML. The problem is that problem like yield optimisation translate to very small changes (on the order of 0.5 - 2 kcal/mol) in barrier height, and even CCSD(T) is simply not up to the task for the systems of this size, where conformational sampling/Boltzmann averaging, explicit solvent effects, and anharmonic effects become important.

Could this approach be applied to other problems, like drug discovery? The application would probably have to be something like IC50 prediction for lead optimisation where the molecules share a common core. One main difference to organometallic catalysis is that the activity of the catalyst is a function of a relatively localised molecular region compared to ligand-protein binding, which has contributions from most of the molecule. It thus seems difficult to find one or a few descriptors that capture the binding, and that can be computed reliably by DFT calculations - especially if no experimental protein-ligand structures are available.

However, this paper suggest that it may be better to focus on developing such descriptors instead of new ML methods.

This work is licensed under a Creative Commons Attribution 4.0 International License.

dxtb —An efficient and fully differentiable framework for extended tight-binding

2024-12-29T11:57:00.000+01:00

Marvin Friede, Christian Hölzer, Sebastian Ehlert, and Stefan Grimme (2024)
Highlighted by Jan Jensen

When I first saw this paper on SoMe I was incredibly excited because I first thought it was the release of the long-anticipated g-xTB method (I have a bad habit of reading very superficially and seeing what I want to see). But then I skimmed the abstract, saw my mistake, and promptly forgot about it the paper until I saw Jan-Michael Mewes' recent Bluesky thread.

The paper described a fully differentiable Python-baed PyTorch implementation of GFN1-xTB. In the paper they use it to compute some new molecular properties, but the real strength will be in developing new xTB methods for specific applications, i.e. a physics-based alternative to ML potentials. Jan give an illustrative example of this in his thread.

While this is application is mentioned in the paper it doesn't contain an actual application. It remains to be seen how fiddly the actual retraining will be, compared to MLPs, but the hope it that the bespoke xTB methods will require significantly less training data and be more broadly applicable than MLPs.

That's assuming that g-xTB doesn't solve all our problems, which is very much my expectation based on Grimme's talks about it (but keep in mind that my listenings skills are even worse than by reading skills).

This work is licensed under a Creative Commons Attribution 4.0 International License.

The vDZP Basis Set Is Effective For Many Density Functionals

2024-11-27T15:05:00.001+01:00

Corin C. Wagen and Jonathon E. Vandezande (2024)
Highlighted by Jan Jensen

While this is an interesting paper, a cursory reading (like the one I did initially) can leave you with the wrong impression. The paper shows that the vDZP basis set that Grimme and co-workers develops as part of the ωB97X-3c method gives good results with other functionals. The results are always better than using other DZ basis sets and sometimes better using than TZ or even QZ basis sets, depending on the property! That's the good news.

The bad news the computational cost of the vDZP basis set is about 40% more expensive than a TZ basis set (at least for typical organic molecules). The reason is that the vDZP consists of more primitives compared to a typical TZ basis set (but considerably less compared to a typical QZ basis set).

So, for me, the main take-home message is that there is a basis set that is somewhere between TZ and QZ in cost, that may be worth trying if the TZ results are not acceptable but QZ is too expensive. However, the paper doesn't show any convincing examples of this. Yes, for isomerization reactions, B97-D3BJ/vDZP is more accurate than B97-3c (which uses the mTZVP basis set) and even B97-D3BJ/def2-QZVP. But you get much better (and faster) results by using r2SCAN-3c (which uses the mTZVPP basis set).

One exception is if you are working with molecules with a lot of heavy atoms (post C row), then vDZP may be faster than TZ basis sets, because it uses ECPs.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Lifelong Machine Learning Potentials

2024-10-31T14:39:00.002+01:00

Marco Eckhoff and Markus Reiher (2023)
Highlighted by Jan Jensen

While machine learning potentials (MLPs) can give you DFT accuracy at FF costs, they also come with some practical problems: they often need to be retrained from scratch when adding new data to avoid catastrophic forgetting, and most structural descriptors struggle to efficiently represent a large number of different chemical elements.

This paper presents solutions to some of these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs) that incorporate periodic table trends, enabling efficient multi-element handling, and by proposing a lifelong learning framework that includes continual learning strategies, the continual resilient (CoRe) optimizer, and uncertainty quantification to allow MLPs to adapt to new data incrementally without losing prior knowledge.

The eeACSFs differ from conventional ACSFs by integrating element information based on periodic table trends rather than creating separate descriptors for each element combination, which allows them to efficiently handle systems with multiple elements without a combinatorial increase in descriptor size.

The CoRe optimizer is designed to balance efficient convergence with stability, adapting dynamically to the learning context. It combines the robustness of RPROP (resilient backpropagation) with the performance benefits of Adam. Specifically, the optimizer adjusts learning rates based on gradient history, which allows for faster convergence initially and a more stable final accuracy. Additionally, it includes a plasticity factor that selectively freezes parameters critical to prior knowledge while allowing other parameters to adapt. This prevents the “catastrophic forgetting” problem common in continual learning, where new learning can overwrite prior knowledge.

The lifelong learning approach include adaptive selection factors where each data point has a selection factor that updates based on its contribution to the loss function. If a point is well-represented in training, its selection factor decreases, reducing its likelihood of being chosen in future training epochs. Conversely, data that are underrepresented have higher selection factors, ensuring they are revisited. In addition, redundant data (those with low loss contributions) are excluded from the training set, reducing the memory and computational load. Data points that the model consistently fails to learn are also excluded, which improves training efficiency and prevents model instability from conflicting data.

The paper acknowledges that further refinement is needed for scenarios involving the addition of new chemical systems. Although the lMLP can expand its conformation space efficiently, accuracy still falls slightly below training on a single large dataset. Additionally, the method’s application to other MLP architectures and addressing consistency across different electronic states and computational methods remain areas for future work. The authors also suggest that larger and more diverse datasets will be necessary to fully realize the potential of lMLPs in simulating complex chemical systems.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Toy Models of Superposition

2024-09-29T15:43:00.006+02:00

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah (2022)
Highlighted by Jan Jensen

Most NNs are notoriously hard to interpret. While there are a few cases, mostly in image classification, where some features (like lines or corners) can be assigned to particular neurons, in general is it seems like every part of the NN contributes to every prediction. This paper provides some powerfull insight into why this is, by analysing simple toy models.

The study builds on the idea that the output of a hidden layers is an N-dimensional embedding vector (V) that encodes a feature of the data (N is the number of neurons in the layers). You might have seen this famous example from language models: V("king") - V("man") + V("woman") = V("queen").

Naively, one would expect that a N-neuron layer can encode N different features, since there are N different (i.e. orthogonal) vectors. However, the papers points out that the number of almost orthogonal vectors (say, with angles between 89° and 91°) increases exponentially with N, so that NNs can represent many more features than they have dimensions, which they term "superposition".

Since most features are stored in orthogonal vectors they will necessarily have many non-zero contributions and this cannot be assigned to a specific neuron. The authors further show that the superposition is driven by data sparcity, i.e. few examples of a particular input feature: more data sparcity, more superposition, less interpretability.

The paper is very thorough and there are many more insights that I have skipped. But I hope this highlight has made you curious enough to have a look at the paper. I can also recommend this brilliant introduction superposition by 3Blue1Brown to get you started.

Now, it's important to note that these insights are obtained by analysing simple toy problems. It will be interesting to see if and how they apply to real-world applications, including chemistry.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Variational Pair-Density Functional Theory: Dealing with Strong Correlation at the Protein Scale

2024-08-28T15:13:00.000+02:00

Mikael Scott, Gabriel L. S. Rodrigues, Xin Li, and Mickael G. Delcey (2024)
Highlighted by Jan Jensen

As I've said before, one of the big problems in quantum chemistry is that we still can't routinely predict the reactivity of TM-containing compounds with the same degree of accuracy as we can for organic molecules. This paper might offer a solution by combining CASSCF with DFT in a variational way.

While such a combination has been done before, that implementation basically compute the DFT energy based on the CASSCF density. If you haven't heard of this approach, it's probably because it didn't work very well.

This paper presents a variational implementation, where you minimise the energy if a CASSCF wavefunction subject to an exchange-correlation density functional, an the results are significantly better - in some cases approaching chemical accuracy! This is pretty impressive given that they used off-the-shelf GGA functionals (BLYP and PBE) so further improvements in accuracy with bespoke functionals is quite likely.

Oh, and one of the applications presented in the paper is multiconfigurational calculation on an entire metallo-protein!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset

2024-07-30T10:59:00.000+02:00

Joonyoung F. Joung, Mun Hong Fong, Jihye Roh, Zhengkai Tu, John Bradshaw, and Connor Wilson Coley (2024)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) the authors 2024

If you don't follow this particular subject, you might be surprised to learn that there isn't a large database of elementary reactions relevant to organic synthesis. Until now.

While datasets such as Reaxys contain millions of reactions, they are typically multistep reactions. That's mostly fine for training retrosynthesis algorithms (although the authors present discuss some disadvantages), but presents a challenge if you want to use more physically based methods such as QM to predict reactivity. For example, while there are some databases of transition states (TSs) they are typically for synthetically irrelevant reactions. So, for example, while very promising methods have been developed for TS prediction, they have been trained on these datasets and are thus have limited practical applicability to synthesis.

This paper is an important step towards fixing this:

"We identified the most popular 86 reaction types in Pistachio and curated elementary reaction templates (Figure 1c) for each of these 86 reaction types with 175 different reaction conditions (e.g., types of mechanisms). ... By applying these expert elementary reaction templates to the reactants in Pistachio, we obtained the recorded products as well as unreported byproducts and side products. We systematically selected and preserved the mechanistic pathways leading to the formation of the recorded product for each entry, resulting in a comprehensive dataset comprising 1.3 million overall reactions and 5.8 million elementary reactions."

The next step is now to use this data to obtain TSs for these elementary reactions - a difficult but important challenge to the CompChem community.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Using GNN property predictors as molecule generators

2024-06-30T12:36:00.000+02:00

Félix Therrien, Edward H. Sargent, and Oleksandr Voznyy (2024)
Highlighted by Jan Jensen

Now this is a very neat idea. Normally, we use back propagation to alter the weight in order to minimise the difference between the output and the ground truth. Instead, the authors use back propagation to alter the input to minimise the difference between the output and a desired value. In this case the input is the molecular adjacency matrix and the result is a molecule with the desired property.

It's one of those "why didn't I think of this?" ideas, but, in practise, there are a few tricky problems to overcome. These include recasting the integer adjacency matrix as a smooth float matrix, finding the right constraints to yield valid molecules, and finding the right loss function. The authors manage to find clever solutions to all these problems and show that this simple idea actually works quite well. As I read it, the current implementation if limited to HCNOF molecules, but generalising it should not be an insurmountable task.

Even if this approach doesn't turn out to be the best generative model, it is one of these obvious (in hindsight) methods that have to be tested to justify more complicated approaches.

This work is licensed under a Creative Commons Attribution 4.0 International License.

FragGT: Fragment-based evolutionary molecule generation with gene types

2024-05-30T15:10:00.001+02:00

Joshua Meyers and Nathan Brown (2024)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) The authors. Reproduced under the CC-BY license

Genetic algorithms (GAs) allow for changes at the atom level (as opposed to molecular fragments) allow for a very fine-grained search of chemical space. However, some of the resulting molecules are not chemically sensible and one usually has to include a synthetic accessibility constraint in the scoring function.

However, another approach is to use fragments and include synthetic accessibility in the fragmentation scheme, which is what this study did. Specifically they use the BRICS fragmentation scheme implemented in RDKit and the corresponding combination rules to turn the genes into molecules.

The authors do indeed find that the resulting molecules do indeed look more reasonable (though it is not quantified). However, the authors note that the method is a "relatively inefficient explorer of chemical space", requiring a large number of scoring function evaluations.

The problem is probably, the short-chromosome/many-genes problem. GAs do best at optimizing long chromosomes made of only a few different genes, while the opposite is the case here: there are 211,388 unique BRICS fragments and each molecule contains only around 10 fragments. So you need to run a lot to make sure that all (reasonably) possible genes have been sampled at each position.

It presents a very interesting open challenge to the cummunity.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Invalid SMILES are beneficial rather than detrimental to chemical language models

2024-04-30T14:24:00.000+02:00

Michael A. Skinnider (2024)
Highlighted by Jan Jensen

Figure 3c from the paper. (c) The author. Reproduced under the CC-BY License

Language models (LMs) don't always produce valid SMILES and while for modern methods the percentage of invalid SMILES tends to be relatively small, much effort has been expended on making it as small as possible. SELFIES was invented as a way to make this percentage 0, since SELFIES is design to always produce valid SMILES.

However, several studies have shown that SMILES-based LMs tends to produce molecular distributions that is closer to the training set, compared to SELFIES. This paper has figured out the reason and it turns out to be both trivial and profound at the same time.

It turns out that the main difference in the molecules produced using SMILES and SELFIES is that the former has a much larger proportion of aromatic atoms. Furthermore, this difference goes away if the SELFIES-based method is allowed to make molecules with pentavalent carbons, which are then subsequently discarded when converted from SELFIES to SMILES.

The reason for this is that in order to generate a valid SMILES or SELFIES string for an aromatic molecule you have to get the sequence of letters exactly right. If it goes wrong for SMILES it is discarded, but if it goes wrong for SELFIES it is usually turned into a valid non-aromatic molecule, i.e. the mistake is not discarded.

For example, the correct SMILES string for benzene is "c1ccccc1", and generated strings with one more or one less "c" character ("c1cccccc1" and "c1cccc1") are invalid and will be removed. The corresponding SELFIES string for benzene is "[C][=C][C][=C][C][=C][Ring1][=Branch1]", but generated strings with one more or one less [C] character will result in non-aromatic molecules with SMILES strings like "C=C1C=CC=CC1" and "C1=CC=CC=1".

There's a lot ML papers that simply observe what works best, but very few that determine why. This is one of them and it is very refreshing!

This work is licensed under a Creative Commons Attribution 4.0 International License.