Sunday, March 31, 2024

An evolutionary algorithm for interpretable molecular representations

Philipp M. Pflüger, Marius Kühnemund, Felix Katzenburg, Herbert Kuchen, and Frank Glorius (2024)
Highlighted by Jan Jensen

Parts of Figures 2 and 6 combined. (c) 2024 Elsevier, Inc

This paper presents a very novel approach to XAI that allows for direct comparison with chemical intuition. Molecular fingerprints (either binary or count) are defined using randomly generated SMARTS patterns and then uses a genetic algorithm to find the optimum fingerprint of a certain length. Here the optimum is defined as the one giving the lowest error when used with CatBoost. The GA search requires many thousands of models so the approach is not practical for more computational expensive ML models. 

Nevertheless, the authors show that CatBoost is competitive with more sophisticated ML models even when using FP lengths as low as 256 (or even 32 in some cases). One can then analyse the SMARTS patterns to gain chemical insights. 

Even more interestingly, one can use the approach to directly compare to chemical intuition. The authors did this by asking five groups of chemists to come up with the 16 most structural features that explain the Doyle-Dreher dataset of 3,960 Buchwald-Hartwig cross-coupling yields. ML models based on the corresponding FPs tended to perform worse than the 16-bit FPs found by the GA. However, it there were also many similarities between the FPs indicating that the method can extract features that are in agreement with chemical intution.  

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, February 28, 2024

AiZynth Impact on Medicinal Chemistry Practice at AstraZeneca

Jason D. Shields, Rachel Howells, Gillian Lamont, Yin Leilei, Andrew Madin, Christopher E. Reimann, Hadi Rezaei, Tristan Reuillon, Bryony Smith, Clare Thomson, Yuting Zhengc and Robert E. Ziegler (2024)
Highlighted by Jan Jensen

Figure 3 from this paper (c) the authors 2020. Reproduced under the CC-BY license

This is one of the rare papers where experimental chemists talk candidly about their experiences using ML models developed by others. In this case it is AiZynthFinder, which is developed at AstraZeneca Gothenburg and predicts retrosynthetic paths, while the users are most synthetic chemists at AstraZeneca in the UK, US, and China. The paper is really well written and well worth reading. I'll just include a few quotes below to whet your appetite.  

"New users of AI tools in general are often disappointed by the failure of AI to live up to their expectations, and chemists' interaction with AiZynth is no exception. The first molecule that most new users test is one that they have personally synthesised recently, and AiZynthFinder rarely replicates their route exactly. Due in part to our self-imposed requirement to run fast searches, AiZynthFinder often gets close to a good route. Thus, experienced users seek inspiration from AiZynth rather than perfection."

"Common problems include proposals that would lead to undesired regioselectivity, functional group incompatibility, or overgeneralisation of precedented reactions to an inappropriate context."

"Early problems also included protection/deprotection cycles, which had to be intentionally penalised in order to focus AiZynth on productive chemistry. We have found that protecting group strategy is still best decided by the chemist. Thus, the AI proposals discussed in the case studies do not make heavy use of protecting groups, whereas several of the laboratory syntheses do."

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, January 31, 2024

TS-Tools: Rapid and Automated Localization of Transition States Based on a Textual Reaction SMILES Input

Thijs Stuyver (2024)
Highlighted by Jan Jensen

Figure 2 from the paper. (c) the author 2024 reproduced under the CC-BY-NC-ND licence

This paper caught my eye for several reasons. It's an open source implementation of Maeda's AFIR method, but modified for double-ended TS searches. The setup is completely automated and interfaced to  xTB so it is fast. It's applied to really challenging problems such as solvent assisted bimolecular reactions and uncovers some important shortcomings of the xTB method. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, December 30, 2023

Accurate transition state generation with an object-aware equivariant elementary reaction diffusion model

Chenru Duan, Yuanqi Du, Haojun Jia, and Heather J. Kulik (2023)
Highlighted by Jan Jensen

Part of Figure 1 from the paper. 

As anyone who has tried it will know, finding TSs is one of the most difficult, fiddly, and frustrating tasks in computational chemistry. While there are several methods aimed at automating the process, they tend to have a mixed success rate or be computationally expensive and, often, both.

This paper looks to be an important first step in the right direction. The method produces a guess at a TS structure based on the coordinates of the reactants and products. Notably, the input structures need not be aligned or atom mapped! 

The method achieves a median RMSD of 0.08 Å compared to the true TSs and it often so good that single point energy evaluation gives a reliable barrier. The method also predicts  a confidence scoring model for uncertainty quantification, which allows you to a priori judge whether such a single point is sufficient or whether a TS search is warranted. The approach allows for accurate reaction barrier estimation (2.6 kcal/mol) with DFT  optimizations needed for only 14% of the most challenging reactions.

So, the method's not going to do away with manual TS searches entirely, but it is going to be invaluable for large scale screening studies. As the authors note, the method can likely also be adapted to the prediction of barrier heights, which could potentially be used to pre-screen  reactions on a much, much bigger scale. 

The paper is an important proof-of-concept study, but needs to be trained on much larger data sets (note that it is only trained on C, N, and O containing molecules), which are non-trivial to obtain. But the method could likely be used to obtain these data sets in an iterative fashion.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, November 30, 2023

Growing strings in a chemical reaction space for searching retrosynthesis pathways

Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen

Part of Figure 10 from the paper. (c) The authors 2023. Reproduced under the CC-NC-ND

Prediction of retrosynthetic reaction trees are typically done by stringing together individual retrosynthetic steps that have the highest predicted confidences. The confidence is typically related to the frequency of the reaction in the training set. This approach has two main problems that this paper addresses. One problem is that "rare" reactions are seldom selected even if they might actually be the most appropriate for a particular problem. The other problem is that you only use local information and "strategical decisions typical of a multi-step synthesis conceived by a human expert".

This paper tries to address these problems by doing the selection of steps differently. The key is to convert the reaction (which are encoded as reaction SMILES) to a fingerprint, i.e. a numerical representation of the reaction SMILES, and using them to compute similarity scores.

For example, in the first step you can use the fingerprint to ensure a diverse selection of reactions to start the synthesis of. In subsequent steps, you can concatenate the individual reaction fingerprints (i.e. the growing string) to compute similarities to reaction paths, rather than individual steps. By selecting paths that are most similar to the training data you could incorporate the "strategical decisions typical of a multi-step synthesis conceived by a human expert". Very clever!

The main problem is how to show that this approach produces better retrosynthetic predictions. Once metric might be shorter paths and the authors to note this but I didn't see any data and it's not necessarily the best metric since, for example important protection/deprotection steps could be missing. The best approach is for synthetic experts to weigh in, but that's hard to do for enough reactions to get good statistics. Perhaps this recent approach would work?

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, October 31, 2023

Few-Shot Learning for Low-Data Drug Discovery

Daniel Vella and Jean-Paul Ebejer (2023)
Highlighted by Jan Jensen

TOC graphic from the article

This paper is an update and expansion to this seminal paper by Pande and co-workers (you should definitely read both). It compares the ability to distinguish active and inactive compounds for few-shots methods to more conventional approaches for very small datasets. It concludes that the former outperform the latter for some data sets and not for others, which is surprising given that few-shot methods are designed with very small data sets in mind.

Few shot methods learn a graph-based embedding that minimizes the distance between samples and their respective class prototypes while maximizing the distance between samples and other class prototypes (where prototypes often are the geometric center of a group of molecules). The training set, which is composed of a "query set" that you are trying to match to a "support" set support set is typically small and changes for each epoch (which is now called episodes) to avoid overfitting.

In this paper, the largest support set was composed of 20 molecules (10 actives and 10 inactives) sampled (together with the query set) from a set of 128 molecules with a 50/50 split of actives and inactives. The performance was then compared to RF and GNN models trained on 20 molecules.

My main takeaway from the paper was actually how well the conventional models performed. Especially given the fact that the conventional models actually had smaller training set, since the few-shot methods saw all 128 molecules during training over the course of the training, whereas the conventional methods only saw a subset.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, September 30, 2023

Ranking Pareto optimal solutions based on projection free energy

Ryo Tamura, Kei Terayama, Masato Sumita, and Koji Tsuda (2023)
Highlighted by Jan Jensen

Figure 1 from the paper. (c) APS 2023. Reproduced under the CC-BY license.

One of the main challenges in multi-objective optimisation is how to weigh the different objectives to get the desired results. Pareto optimisation can in principle solve this problem, but of you get too many solutions you have to select a subset for testing, which basically involves (manually) weighing the importance of each objective.

This paper proposes a new way to select the potentially most interesting candidates. The idea is basically to identify the most "novel" candidates to maximise the chances of finding "interesting" properties, They do this by identifying points on the Pareto front with the lowest "density of states" for each objective, i.e. points with few examples in property space.

The method is presented as a post hoc selection method, but could also be used as a search criteria to help focus the search on these areas of property spaces. 

This work is licensed under a Creative Commons Attribution 4.0 International License.