Tuesday, April 30, 2024

Invalid SMILES are beneficial rather than detrimental to chemical language models

Michael A. Skinnider (2024)
Highlighted by Jan Jensen

Figure 3c from the paper. (c) The author. Reproduced under the CC-BY License

Language models (LMs) don't always produce valid SMILES and while for modern methods the percentage of invalid SMILES tends to be relatively small, much effort has been expended on making it as small as possible. SELFIES was invented as a way to make this percentage 0, since SELFIES is design to always produce valid SMILES.

However, several studies have shown that SMILES-based LMs tends to produce molecular distributions that is closer to the training set, compared to SELFIES. This paper has figured out the reason and it turns out to be both trivial and profound at the same time.

It turns out that the main difference in the molecules produced using SMILES and SELFIES is that the former has a much larger proportion of aromatic atoms. Furthermore, this difference goes away if the SELFIES-based method is allowed to make molecules with pentavalent carbons, which are then subsequently discarded when converted from SELFIES to SMILES.

The reason for this is that in order to generate a valid SMILES or SELFIES string for an aromatic molecule you have to get the sequence of letters exactly right. If it goes wrong for SMILES it is discarded, but if it goes wrong for SELFIES it is usually turned into a valid non-aromatic molecule, i.e. the mistake is not discarded. 

For example, the correct SMILES string for benzene is "c1ccccc1", and generated strings with one more or one less "c" character ("c1cccccc1" and "c1cccc1") are invalid and will be removed. The corresponding SELFIES string for benzene is "[C][=C][C][=C][C][=C][Ring1][=Branch1]", but generated strings with one more or one less [C] character will result in non-aromatic molecules with SMILES strings like "C=C1C=CC=CC1" and "C1=CC=CC=1".

There's a lot ML papers that simply observe what works best, but very few that determine why. This is one of them and it is very refreshing!

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, March 31, 2024

An evolutionary algorithm for interpretable molecular representations

Philipp M. Pflüger, Marius Kühnemund, Felix Katzenburg, Herbert Kuchen, and Frank Glorius (2024)
Highlighted by Jan Jensen

Parts of Figures 2 and 6 combined. (c) 2024 Elsevier, Inc

This paper presents a very novel approach to XAI that allows for direct comparison with chemical intuition. Molecular fingerprints (either binary or count) are defined using randomly generated SMARTS patterns and then uses a genetic algorithm to find the optimum fingerprint of a certain length. Here the optimum is defined as the one giving the lowest error when used with CatBoost. The GA search requires many thousands of models so the approach is not practical for more computational expensive ML models. 

Nevertheless, the authors show that CatBoost is competitive with more sophisticated ML models even when using FP lengths as low as 256 (or even 32 in some cases). One can then analyse the SMARTS patterns to gain chemical insights. 

Even more interestingly, one can use the approach to directly compare to chemical intuition. The authors did this by asking five groups of chemists to come up with the 16 most structural features that explain the Doyle-Dreher dataset of 3,960 Buchwald-Hartwig cross-coupling yields. ML models based on the corresponding FPs tended to perform worse than the 16-bit FPs found by the GA. However, it there were also many similarities between the FPs indicating that the method can extract features that are in agreement with chemical intution.  

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, February 28, 2024

AiZynth Impact on Medicinal Chemistry Practice at AstraZeneca

Jason D. Shields, Rachel Howells, Gillian Lamont, Yin Leilei, Andrew Madin, Christopher E. Reimann, Hadi Rezaei, Tristan Reuillon, Bryony Smith, Clare Thomson, Yuting Zhengc and Robert E. Ziegler (2024)
Highlighted by Jan Jensen

Figure 3 from this paper (c) the authors 2020. Reproduced under the CC-BY license

This is one of the rare papers where experimental chemists talk candidly about their experiences using ML models developed by others. In this case it is AiZynthFinder, which is developed at AstraZeneca Gothenburg and predicts retrosynthetic paths, while the users are most synthetic chemists at AstraZeneca in the UK, US, and China. The paper is really well written and well worth reading. I'll just include a few quotes below to whet your appetite.  

"New users of AI tools in general are often disappointed by the failure of AI to live up to their expectations, and chemists' interaction with AiZynth is no exception. The first molecule that most new users test is one that they have personally synthesised recently, and AiZynthFinder rarely replicates their route exactly. Due in part to our self-imposed requirement to run fast searches, AiZynthFinder often gets close to a good route. Thus, experienced users seek inspiration from AiZynth rather than perfection."

"Common problems include proposals that would lead to undesired regioselectivity, functional group incompatibility, or overgeneralisation of precedented reactions to an inappropriate context."

"Early problems also included protection/deprotection cycles, which had to be intentionally penalised in order to focus AiZynth on productive chemistry. We have found that protecting group strategy is still best decided by the chemist. Thus, the AI proposals discussed in the case studies do not make heavy use of protecting groups, whereas several of the laboratory syntheses do."

This work is licensed under a Creative Commons Attribution 4.0 International License.

Wednesday, January 31, 2024

TS-Tools: Rapid and Automated Localization of Transition States Based on a Textual Reaction SMILES Input

Thijs Stuyver (2024)
Highlighted by Jan Jensen

Figure 2 from the paper. (c) the author 2024 reproduced under the CC-BY-NC-ND licence

This paper caught my eye for several reasons. It's an open source implementation of Maeda's AFIR method, but modified for double-ended TS searches. The setup is completely automated and interfaced to  xTB so it is fast. It's applied to really challenging problems such as solvent assisted bimolecular reactions and uncovers some important shortcomings of the xTB method. 

This work is licensed under a Creative Commons Attribution 4.0 International License.

Saturday, December 30, 2023

Accurate transition state generation with an object-aware equivariant elementary reaction diffusion model

Chenru Duan, Yuanqi Du, Haojun Jia, and Heather J. Kulik (2023)
Highlighted by Jan Jensen

Part of Figure 1 from the paper. 

As anyone who has tried it will know, finding TSs is one of the most difficult, fiddly, and frustrating tasks in computational chemistry. While there are several methods aimed at automating the process, they tend to have a mixed success rate or be computationally expensive and, often, both.

This paper looks to be an important first step in the right direction. The method produces a guess at a TS structure based on the coordinates of the reactants and products. Notably, the input structures need not be aligned or atom mapped! 

The method achieves a median RMSD of 0.08 Å compared to the true TSs and it often so good that single point energy evaluation gives a reliable barrier. The method also predicts  a confidence scoring model for uncertainty quantification, which allows you to a priori judge whether such a single point is sufficient or whether a TS search is warranted. The approach allows for accurate reaction barrier estimation (2.6 kcal/mol) with DFT  optimizations needed for only 14% of the most challenging reactions.

So, the method's not going to do away with manual TS searches entirely, but it is going to be invaluable for large scale screening studies. As the authors note, the method can likely also be adapted to the prediction of barrier heights, which could potentially be used to pre-screen  reactions on a much, much bigger scale. 

The paper is an important proof-of-concept study, but needs to be trained on much larger data sets (note that it is only trained on C, N, and O containing molecules), which are non-trivial to obtain. But the method could likely be used to obtain these data sets in an iterative fashion.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Thursday, November 30, 2023

Growing strings in a chemical reaction space for searching retrosynthesis pathways

Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen

Part of Figure 10 from the paper. (c) The authors 2023. Reproduced under the CC-NC-ND

Prediction of retrosynthetic reaction trees are typically done by stringing together individual retrosynthetic steps that have the highest predicted confidences. The confidence is typically related to the frequency of the reaction in the training set. This approach has two main problems that this paper addresses. One problem is that "rare" reactions are seldom selected even if they might actually be the most appropriate for a particular problem. The other problem is that you only use local information and "strategical decisions typical of a multi-step synthesis conceived by a human expert".

This paper tries to address these problems by doing the selection of steps differently. The key is to convert the reaction (which are encoded as reaction SMILES) to a fingerprint, i.e. a numerical representation of the reaction SMILES, and using them to compute similarity scores.

For example, in the first step you can use the fingerprint to ensure a diverse selection of reactions to start the synthesis of. In subsequent steps, you can concatenate the individual reaction fingerprints (i.e. the growing string) to compute similarities to reaction paths, rather than individual steps. By selecting paths that are most similar to the training data you could incorporate the "strategical decisions typical of a multi-step synthesis conceived by a human expert". Very clever!

The main problem is how to show that this approach produces better retrosynthetic predictions. Once metric might be shorter paths and the authors to note this but I didn't see any data and it's not necessarily the best metric since, for example important protection/deprotection steps could be missing. The best approach is for synthetic experts to weigh in, but that's hard to do for enough reactions to get good statistics. Perhaps this recent approach would work?

This work is licensed under a Creative Commons Attribution 4.0 International License.

Tuesday, October 31, 2023

Few-Shot Learning for Low-Data Drug Discovery

Daniel Vella and Jean-Paul Ebejer (2023)
Highlighted by Jan Jensen

TOC graphic from the article

This paper is an update and expansion to this seminal paper by Pande and co-workers (you should definitely read both). It compares the ability to distinguish active and inactive compounds for few-shots methods to more conventional approaches for very small datasets. It concludes that the former outperform the latter for some data sets and not for others, which is surprising given that few-shot methods are designed with very small data sets in mind.

Few shot methods learn a graph-based embedding that minimizes the distance between samples and their respective class prototypes while maximizing the distance between samples and other class prototypes (where prototypes often are the geometric center of a group of molecules). The training set, which is composed of a "query set" that you are trying to match to a "support" set support set is typically small and changes for each epoch (which is now called episodes) to avoid overfitting.

In this paper, the largest support set was composed of 20 molecules (10 actives and 10 inactives) sampled (together with the query set) from a set of 128 molecules with a 50/50 split of actives and inactives. The performance was then compared to RF and GNN models trained on 20 molecules.

My main takeaway from the paper was actually how well the conventional models performed. Especially given the fact that the conventional models actually had smaller training set, since the few-shot methods saw all 128 molecules during training over the course of the training, whereas the conventional methods only saw a subset.

This work is licensed under a Creative Commons Attribution 4.0 International License.