## Wednesday, November 30, 2022

### Quantum Chemical Data Generation as Fill-In for Reliability Enhancement of Machine-Learning Reaction and Retrosynthesis Planning

Part of Figure 7 from the paper. (c) The authors 2022. Reproduced under the CC BY NC ND 4.0 license

This is the first paper I have seen on combining automated QM-reaction prediction with ML-based retrosynthesis prediction. The idea itself is simple: for ML-predictions with low confidence (i.e. few examples in the training data) can automated QM-reaction prediction be used to check whether the proposed reaction is feasible, i.e. whether it is the reaction path with the lowest barrier?  If so, it could also be used to augment the training data.

The paper considers two examples using the Chemoton 2.0 method: one where the reaction is an elementary reaction and one where there are two steps (the Friedel-Crafts reaction shown above). It works pretty well for the former, but runs into problems for the latter.

One problem for non-elementary reactions is that one can't predict which atoms are chemically active from the overall reaction. Chemoton therefore must consider reactions involving all atom pairs and preferably more pairs of atoms simultaneously. The number of required calculations quickly gets out of hand and the authors conclude that "For such multistep reactions, new methods to identify the individual elementary steps will have to be developed to maintain the exploration within tight bounds, and hence, within reasonable computing time."

However, even when they specify the two elementary steps for the Friedel-Crafts reaction, their method fails to find the second elementary step. The reason for this failure is not clear but could be due to the semiempirical xTB used for efficiency.

So the paper presents an interesting and important challenge to computational chemistry community. I wish more papers did this.

## Monday, October 31, 2022

### Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory

Frank Hu, Francis He, David J. Yaron (2022)
Highlighted by Jan Jensen

Figure 7 from the paper. (c) The authors 2022. Reproduced under the BY-NC-ND licence

This paper uses ML techniques and algorithms (specifically PyTorch) to fit DFTB parameters, which results in a semiempirical quantum method (SQM) that has an accuracy similar to DFT. The advantage of such a physics-based method over a pure ML-based is that it is likely to be more transferable and requires much less training data. This should make it much easier to extend to other elements and new molecular properties, such as barriers.

Parameterising SQMs is notoriously difficult as the molecular properties depend exponentially on many of the parameters. As a result, most SQMs used today have parameterised by hand. The paper presents several methodological tricks to automate the fitting.

One is the use of high-order polynomial spline functions to describe how the Hamiltonian elements depend the fitting-parameters. The functions allow the computation of not only of the first derivative needed for back propagation, but also high-order derivatives, which are used for regularisation to avoid overfitting and keeping the parameters physically reasonable. Finally, the SCF and training loops are inverted to that the he charge fluctuations needed for the Fock operator are updated based on the current model parameters every 10 epochs. This enables computationally efficient back propagation during training, which is important because the training set is on the order of 100k.

Another neat feature is that the final model is simply a parameter file (SKF file), which can be read by most DFTB programs. So there is nothing new for the user to implement. However, currently the implementation is only for CNHO.

## Friday, September 30, 2022

### Active Learning for Small Molecule pKa Regression; a Long Way To Go

Parts of Figures 5 and 6. (c) The authors 2022. Reproduced under the CC-BY licence

One approach to active learning is to grow the training set with molecules for which the current model has the highest uncertainties. However,  according to this study, this approach does not seem to work for small-molecule pKa prediction where active learning and random selection give the same results (within the relatively high standard deviations) for three different uncertainty estimated.

The authors show that there are molecules in the pool that can increase the  initial accuracy drastically, but that the uncertainties don't seem to help identify these molecules. The green curve above is obtained by exhaustively training a new model for every molecule in the pool during each step of the active learning  loop and selecting the molecule that gives the largest increase in accuracy for the test set. Note that the accuracy decreases towards the end meaning that including some molecules in the training set diminishes the performance.

The authors offer the following explanation for their observations: "We propose that the reason active  learning failed in this pKa prediction task is that all of the molecules are informative."

That's certainly not hard to imagine given the is the small size of the initial training set (50). It would have been very instructive to see the distribution of uncertainties for the initial models. Does every molecule have roughly the same (high) uncertainty? If so, the uncertainties would indeed not be informative.

Also, uncertainties only correlate with (random) errors on average. The authors did try adding molecules in batches, but the batch size was only 10.

It would have been interesting to see the performance if one used the actual error, rather than the uncertainties, to select molecules. That would test the case where uncertainties correlate perfectly with the errors.

## Tuesday, August 30, 2022

### Is there evidence for exponential quantum advantage in quantum chemistry?

Figure 1 from the paper. (c) 2022 the authors. Reproduced under the CC-BY licence.

Quantum chemical calculations are widely seen as one of quantum computings killer app's. This paper examines the available evidence for this assertion and doesn't find any.

The potential of quantum computing rests on two assumptions: that the cost of quantum computer calculations on chemical systems scales polynomially with system size, while the corresponding calculations on classical computers scale exponentially.

The former assumption is true for the actual quantum "computation" and the latter assertion is true for the Full CI solution. However, this paper suggests that preparing the state for the quantum "computation" may scale exponentially with system size, and that we don't need Full CI accuracy and that chemically accurate methods such as coupled-cluster based method scale polynomially with system size for a given desired accuracy.

The argument for the potential exponential scaling for system preparation is as follows: If you want the energy of the ground state you have to provide a guess at the ground state wavefunction that resembles the exact wavefunction as much as possible. More precisely, the probability of obtaining the ground state energy scales as $S^{-2}$, where S is the overlap between the trial and exact wavefunction. The authors show that $S$  scales exponentially with system size for a series of Fe-S clusters, which suggests an overall exponential dependence for the quantum computations.

The argument for polynomial scaling of chemically accurate quantum chemistry calculations has two parts: "normal" organic molecules and strongly correlated systems.

The former is pretty straight-forward: no one knowledgeable is really arguing that CCSD(T)-level accuracy is insufficient for ligand-protein binding energies and CCSD(T) scales polynomially with system size. So the simple notion of accelerating drug discovery by computing this with quantum computers does not hold water.

However, CCSD(T) does not work for strongly correlated systems and we don't have any real practical alternative for which we can test the scaling. Instead the authors look at simpler model of strongly correlated systems and demonstrate polynomial scaling with system size.

As the authors are carefull to point out, none of this represents a rigorous proof of anything. But it is far from obvious that quantum chemistry is the killer app for quantum computing that most people seem to think it is.

## Sunday, July 31, 2022

### Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization

Figure 1 from the paper. (c) The authors 2022. Reproduced under the CC-BY license.

The development of generative models that can find molecules with certain properties has become very popular but there are very few studies that compare them, so it's hard to know what works best. This study compares the performance of 25 different generative models in 23 different optimisation tasks and draws some very interesting conclusions.

None of these methods find the optimum value given an "budget" of 10,000 oracle evaluations and for some tasks the best performance is not exactly impressive. This doesn't bode well for some real life applications where even a few hundred property evaluations are challenging.

Some methods are slower to converge than others, so you might draw completely different conclusions regarding efficiency if you 100,000 oracle evaluations. Similarly, some methods have high variability in performance so you might draw very different conclusions from 1 run compared to 10 runs. This is especially a consideration for problems when you can only afford one run. It might be better to choose a method that performs slightly worse on average but is less variable, rather than risk a bad run from a highly variable method that performs better on average.

The method that performed best overall is one of the oldest methods, published in 2017!

Food for thought

## Wednesday, June 29, 2022

### Deep Learning Metal Complex Properties with Natural Quantum Graphs

Figure 2 from the paper (c) The authors. Reproduced under the CC-BY-NC-ND 4.0 license

While there's been a huge amount of ML work on organic molecules, there as been comparatively little on trantition metal complexes (TMCs). One of the reasons is that many of the cheminformatics tools we take for granted are harder to apply to TMCs due to their more complex bonding situations. This makes bond perception and computing node-features like formal atomic charges, and hence graph representations, quite tricky. Which, in turn, makes standard ML tools like binary finger prints or graph-convolution NNs tricky to apply to TMCs.

This paper suggest using data from DFT/NBO calculations to create so-called "quantum graphs", where the edges are determined using both bonding orbitals and bond-orders while node- and edge-features are derived from other NBO properties.

This representation is combined with two graph-NN methods (MPNN and MXMNet) and trained against DFT properties such as the HOMO-LUMO gap. The results are quite good and generally better than radius graph methods such as SchNet. However, one should keep in mind that both the descriptors and properties are computed with DFT.

Given that the computational cost of the descriptors is basically the same as the property of interest, this is a proof-of-concept paper that shows the utility of the general idea. However, it remains to be seen whether cheaper descriptors (e.g. based on semi-empirical calculations) result in similar performance. However, given the current sparcity of ML tools for TMCs this is a very welcome advance.

## Monday, May 30, 2022

### Computer-designed repurposing of chemical wastes into drugs

Figure 2a from the paper. (c) 2022 the authors

When I talk to people about retrosynthesis prediction the often mention that synthetic chemists don't tend to use them. There are many reasons for that including various shortcomings of the suggested routes but also the fact that, from a time saving perspective, the retrosynthesis planning makes up a small part of the synthesis process. One common answer to this is "OK, but wait til the robots arrive", but there are several important applications that are applicable right now.

For example, on my own research in de novo molecule discovery I'm often left with hundreds of promising molecules where the only remaining selection criterion is ease of synthesis. Here I routinely use retrosynthesis programs to rank the molecules in terms of number of synthesis steps to make the shortlist of 10-20 molecules that can be presented to experimental collaborators.

This paper presents another example of science that would be impossible without these computational tools. The authors search for reaction networks that connect 189 small molecule waste by-products from chemical industry to 4113 high-value molecules (approved drugs and agrochemicals). The use a reaction prediction algorithm called Allchemy to iteratively generate increasingly complicated molecules and, at each step, bias the search towards the target. Among the 300 million molecules that result from this process the were able to identify 167 target molecules, with an average of 216 synthetic paths per target. The synthetic paths are further ranked using a complicated scoring functions that accounts for all sorts of practical considerations, since aim is to produce large quantities of each target, and a few of the paths are experimentally verified on the kg scale.

One interesting part the approach is the prediction of reaction conditions, which is done in terms of categories: e.g. protic/aprotic and polar/nonpolar solvents, and very low, low, room temperature, high, and very high temperatures. This makes a lot more sense to than trying to predict the exact solvent or temperature.