Saturday, December 30, 2023

Accurate transition state generation with an object-aware equivariant elementary reaction diffusion model

Chenru Duan, Yuanqi Du, Haojun Jia, and Heather J. Kulik (2023)
Highlighted by Jan Jensen

Part of Figure 1 from the paper. 

As anyone who has tried it will know, finding TSs is one of the most difficult, fiddly, and frustrating tasks in computational chemistry. While there are several methods aimed at automating the process, they tend to have a mixed success rate or be computationally expensive and, often, both.

This paper looks to be an important first step in the right direction. The method produces a guess at a TS structure based on the coordinates of the reactants and products. Notably, the input structures need not be aligned or atom mapped! 

The method achieves a median RMSD of 0.08 Å compared to the true TSs and it often so good that single point energy evaluation gives a reliable barrier. The method also predicts  a confidence scoring model for uncertainty quantification, which allows you to a priori judge whether such a single point is sufficient or whether a TS search is warranted. The approach allows for accurate reaction barrier estimation (2.6 kcal/mol) with DFT  optimizations needed for only 14% of the most challenging reactions.

So, the method's not going to do away with manual TS searches entirely, but it is going to be invaluable for large scale screening studies. As the authors note, the method can likely also be adapted to the prediction of barrier heights, which could potentially be used to pre-screen  reactions on a much, much bigger scale. 

The paper is an important proof-of-concept study, but needs to be trained on much larger data sets (note that it is only trained on C, N, and O containing molecules), which are non-trivial to obtain. But the method could likely be used to obtain these data sets in an iterative fashion.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, November 30, 2023

Growing strings in a chemical reaction space for searching retrosynthesis pathways

Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen

Part of Figure 10 from the paper. (c) The authors 2023. Reproduced under the CC-NC-ND

Prediction of retrosynthetic reaction trees are typically done by stringing together individual retrosynthetic steps that have the highest predicted confidences. The confidence is typically related to the frequency of the reaction in the training set. This approach has two main problems that this paper addresses. One problem is that "rare" reactions are seldom selected even if they might actually be the most appropriate for a particular problem. The other problem is that you only use local information and "strategical decisions typical of a multi-step synthesis conceived by a human expert".

This paper tries to address these problems by doing the selection of steps differently. The key is to convert the reaction (which are encoded as reaction SMILES) to a fingerprint, i.e. a numerical representation of the reaction SMILES, and using them to compute similarity scores.

For example, in the first step you can use the fingerprint to ensure a diverse selection of reactions to start the synthesis of. In subsequent steps, you can concatenate the individual reaction fingerprints (i.e. the growing string) to compute similarities to reaction paths, rather than individual steps. By selecting paths that are most similar to the training data you could incorporate the "strategical decisions typical of a multi-step synthesis conceived by a human expert". Very clever!

The main problem is how to show that this approach produces better retrosynthetic predictions. Once metric might be shorter paths and the authors to note this but I didn't see any data and it's not necessarily the best metric since, for example important protection/deprotection steps could be missing. The best approach is for synthetic experts to weigh in, but that's hard to do for enough reactions to get good statistics. Perhaps this recent approach would work?

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, October 31, 2023

Few-Shot Learning for Low-Data Drug Discovery

Daniel Vella and Jean-Paul Ebejer (2023)
Highlighted by Jan Jensen

TOC graphic from the article

This paper is an update and expansion to this seminal paper by Pande and co-workers (you should definitely read both). It compares the ability to distinguish active and inactive compounds for few-shots methods to more conventional approaches for very small datasets. It concludes that the former outperform the latter for some data sets and not for others, which is surprising given that few-shot methods are designed with very small data sets in mind.

Few shot methods learn a graph-based embedding that minimizes the distance between samples and their respective class prototypes while maximizing the distance between samples and other class prototypes (where prototypes often are the geometric center of a group of molecules). The training set, which is composed of a "query set" that you are trying to match to a "support" set support set is typically small and changes for each epoch (which is now called episodes) to avoid overfitting.

In this paper, the largest support set was composed of 20 molecules (10 actives and 10 inactives) sampled (together with the query set) from a set of 128 molecules with a 50/50 split of actives and inactives. The performance was then compared to RF and GNN models trained on 20 molecules.

My main takeaway from the paper was actually how well the conventional models performed. Especially given the fact that the conventional models actually had smaller training set, since the few-shot methods saw all 128 molecules during training over the course of the training, whereas the conventional methods only saw a subset.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, September 30, 2023

Ranking Pareto optimal solutions based on projection free energy

Ryo Tamura, Kei Terayama, Masato Sumita, and Koji Tsuda (2023)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) APS 2023. Reproduced under the CC-BY license.

One of the main challenges in multi-objective optimisation is how to weigh the different objectives to get the desired results. Pareto optimisation can in principle solve this problem, but of you get too many solutions you have to select a subset for testing, which basically involves (manually) weighing the importance of each objective.

This paper proposes a new way to select the potentially most interesting candidates. The idea is basically to identify the most "novel" candidates to maximise the chances of finding "interesting" properties, They do this by identifying points on the Pareto front with the lowest "density of states" for each objective, i.e. points with few examples in property space.

The method is presented as a post hoc selection method, but could also be used as a search criteria to help focus the search on these areas of property spaces. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, August 30, 2023

Accelerated dinuclear palladium catalyst identification through unsupervised machine learning

Julian A. Hueffel, Theresa Sperger, Ignacio Funes-Ardoiz, Jas S. Ward, Kari Rissanen, Franziska Schoenebeck (2021)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) 2021 the authors.

I've been meaning to highlight this paper for years but forgot. However, in the last week k-means clustering came up twice in two completely unrelated contexts, which reminded me of this beautiful paper where the authors managed to use ML to make successful predictions based only five data points! 

Pd catalysts can exist in either in a dimer or monomer form depending on the ligands and there are no heuristic rules for predicting what form will be favoured by a particular ligand. Even DFT-computed dimerization energies fail to give inconsistent predictions.

The authors started with a database of 348 ligands each characterised with 28 different descriptors, which were dived into eight groups by k-mean clustering of the descriptors. The four ligands known to favour dimer formation where found in two clusters, with a combined size of 89 ligands. The prediction is thus that these 89 ligands are more likely to favour dimer formation, compared to the other 256. 

The authors decided to focus on the 66 ligands in the 89 subset that contain P-C bonds and computed 42 new DFT-computed descriptors that explicitly address dimer formation, such as the dimerization energy. Based these and the old descriptors the authors grouped the 66 ligands into six clusters, where two of the clusters, with a combined size of 25, contained the four known dimer-ligands. The prediction is this that the other 21 ligands also should form dimers.

It's a little unclear, but from I can tell the authors then experimentally tested nine of the 21 ligands, of which seven formed dimers. That's a very good hit rate starting from five data points!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, July 31, 2023

Real-World Molecular Out-Of-Distribution: Specification and Investigation

Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, Emmanuel Noutahi (2023)
Highlighted by Jan Jensen

Part of Figure 1 from this report

Why do ML models perform much worse different test sets? There can be many reasons for such a shift in performance, but the main culprit is often a covariate shift meaning that the training and test set are quite different. This study seeks to quantify this effect for different molecular representations, ML algorithms, and datasets (both regression and classification).

The authors find that the difference between the test and train error (from a random split) is mostly governed by the representation (as opposed the the ML algorithm). Furthermore, representations that results in shorter distances between molecules (specifically 5-NN distances) on average are the ones that give a smaller difference in error between training and test set.  However, those representations do not necessarily result in lower test set errors. 

So you while you can't use representation distances to pick the representation you can use them to pick the best splitting method for obtaining your training set. The best test set it the one that with the shortest overall representation distance to the deployment set (i.e. the set you want to use your ML model on). The authors find that the best splitting method depends on the representation but is often scaffold splitting. 

Thanks to Cas Wogum for a very helpful discussion.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, June 26, 2023

Evolutionary Multiobjective Optimization of Multiligand Metal Complexes in Diverse and Vast Chemical Spaces

Hannes Kneiding, Ainara Nova, David Balcells (2023)
Highlighted by Jan Jensen

Figure 5 from the paper. (c) 2023 the authors. Reproduced under the CC BY ND license

The authors show that an NBO analysis can be used to identify the charges (as well as their coordination mode) of individual ligands in TM-complexes. This is a key property needed to properly characterise the ligands and, thus, the complex as a whole. They have manually checked the approach for 500 compounds and finds that it gives reasonable results in 95% of the cases. That number drops to 92% if coordination mode is also considered. They provide these, and many other, properties of 30K ligands extracted from the CSD.

The NBO analysis is based on PBE/TZV//PBE/DZV calculations, which are a bit costly, but it will be interesting to see whether lower theories (e.g. DZV//xTB) give similar results.

Based on this knowledge the authors build a data set of 1.37B square-planar Pd compounds and compute their polarizability and HOMO-LUMO gap. They then search this space for molecules with both large polarizabilities and HOMO-LUMO gaps using a genetical algorithm that optimises the Pareto front, and show that optimum solutions can be found by considering only 1% if the entire space. The GA code is not available yet, but should be released soon.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, May 30, 2023

Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Adapted from Figures 1 and 3 in the paper. (c) 2023 the authors 

While this fascinating paper is not about chemistry it could easily be applied to chemical problems without further modifications (except for graph convolution), so I feel justified in highlighting it here.

The paper introduces brain-inspired modular training (BIMT) which leads to relatively simple NNs that are easier to interpret. "Brain-inspired" comes from the fact that the brain is not fully connected like most NNs, since it is a 3D entity with physical connections (axons) and longer axons mean slower communication between neurons. The idea is to enforce this modularity during trainings by assigning positions to individual nodes and introducing a length-dependent penalty in the loss function (in addition to conventional L1 regularisation). This is combined with a swap operation that can swap neurons to decrease the loss.

The result is much simpler networks that, at least for relatively simple objectives, are intuitive and easier to interpret as you can see from the figure above. 

The code is available here (Google Colab version) It would be very interesting to apply this to chemical problems!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, April 30, 2023

Virtual Ligand Strategy in Transition Metal Catalysis Toward Highly Efficient Elucidation of Reaction Mechanisms and Computational Catalyst Design

Wataru Matsuoka, Yu Harabuchi, and Satoshi Maeda (2023)
Highlighted by Jan Jensen

This perspective shows how an old computational tool can be adapted to serve a new purpose. When I started in compchem changing, say, a few F atoms to and H atoms in a molecule often made the difference between waiting a few days and a few weeks for the calculations to finish. People therefore developed pseudo H atoms that could mimic the electronic effect of larger atoms or even entire functional groups. Some of these methods were later adapted to serve as boundary atoms in QM/MM calculations and now they have found a new use in screening for ligands in organometallic catalysts.

The use of pseudoatoms to model such ligands not only speeds up the individual calculations but also maps the chemical space on to just two dimensions, electronic and steric, that allows the space to be searched more efficiently. Once the desired combination of electronics and sterics is found corresponding real ligands are found by another, much faster, screen if commercially available or synthetically accessible ligands.

The authors use this approach to identify two phosphine ligands for a chemoselective Suzuki–Miyaura cross-coupling catalyst, complete with experimental verification.

The downside is that the parameterisation of these "virtual ligands" are a bit involved and very ligand-dependent. But an interesting approach non-the-less.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, March 29, 2023

eChem: A Notebook Exploration of Quantum Chemistry

Thomas Fransson, Mickael G. Delcey, Iulia Emilia Brumboiu, Manuel Hodecker, Xin Li, Zilvinas Rinkevicius, Andreas Dreuw, Young Min Rhee, and Patrick Norman (2023)
Highlighted by Jan Jensen

eChem is an e-book that mixes text and code to teach quantum chemistry. The code is based on VeloxChem, which is a Python-based open source quantum chemistry software package. 

While you can use VeloxChem to perform standard quantum chemical calculations, the really cool thing is that it gives you easy access to the basis setintegrals and orbitals, DFT grids and functionals, etc. This in turn allows you to write your own SCF or Kohn-Sham-SCF procedure. It's sorta like Szabo and Ostlund updated and taken to the next level. 

If you truly want to understand quantum chemistry this is the way to go! One of the co-authors, Xin Li, very kindly got it working on Google Colab, so it is very easy to start playing around with it yourself. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, February 27, 2023

Prediction of High-Yielding Single-Step or Cascade Pericyclic Reactions for the Synthesis of Complex Synthetic Targets

Tsuyoshi Mita, Hideaki Takano, Hiroki Hayashi, Wataru Kanna, Yu Harabuchi, K. N. Houk, and Satoshi Maeda (2022)
Highlighted by Jan Jensen

This paper has been on my to-do list for a while, but Derek Lowe beat me to it (again). DFT-based reaction prediction has yet to make an impact on synthesis planning due to the fact that many are complexities we still have to deal with efficiently, such as solvent effects in ionic mechanisms (very hard to predict accurately), catalysts and additives, chirality, and, well, just the sheer size of the reaction space. 

While these things will be dealt with in good time, it makes sense to see if there are any low-hanging fruits that can be picked under the current limitations, that still have "real life" applications. And this study did just that, by choosing pericyclic reactions. These are very popular reactions in organic synthesis, but require no catalysts nor additives and have minimal solvent effects. Furthermore, some use cases of this reaction in natural product synthesis can be very hard to spot, even for seasoned synthetic chemists, and the authors show that their algorithm can predict it a priori. So this could potentially be a useful tool for specific types synthesis planning.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, January 30, 2023

Machine-Learning-Guided Discovery of Electrochemical Reactions

Andrew F. Zahrt, Yiming Mo, Kakasaheb Y. Nandiwale, Ron Shprints, Esther Heid, and Klavs F. Jensen (2022)
Highlighted by Jan Jensen

Derek Lowe has highlighted the chemical aspects of this work already, so here I focus on the machine learning, which is pretty interesting. The authors want to predict whether a molecule will react with 4-dicyanobenzene anion after it is oxized at a cathode. They have 141 data points of which 42% show a reaction.

They tested several classification models using Morgan fingerprints as the molecular representation, but got at accuracy of only 60%. The then reasoned that the accuracy could be improved by using DFT features. However, rather than using molecular features they decided to use atomic features from an NBO analysis on the radical cation, neutral, radical anion. The feature vector was then tested on several data sets and shown to perform well.

The question is then how to combine the atomic feature vectors to a molecular representation for the reaction classification. The usual way is graph convolution but that'll require more than 141 data points to optimise. So instead they use graph2vec, which is an unsupervised learning method so it is easy to create arbitrarily large training sets. Graph2vec is analogous to word2vec (or, more accurately, doc2vec) which creates vector representations of words by predicting context in text (i.e. words that often appear close to the word of interest). For graph2vec the context is subgraphs of the input graph. 

The graph2vec embedder was then trained on 38k molecules (note that this requires 38k DFT calculations). Using this representation, the accuracy for the reaction classifier increased to 74%, which is a significant improvement compared to Morgan fingerprints. The classifier was then applied to the 38k molecules and 824 were predicted to be reactive. Twenty of these were selected for experimental validation and 16 (80%) were shown to be reactive. That's not a bad hit rate!

I was not aware of graph2vec before reading this paper and it seems like a very promising alternative to graph convolution, especially in the low data regime.

This work is licensed under a Creative Commons Attribution 4.0 International License.