Thursday, November 30, 2023

Growing strings in a chemical reaction space for searching retrosynthesis pathways

Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen


Part of Figure 10 from the paper. (c) The authors 2023. Reproduced under the CC-NC-ND

Prediction of retrosynthetic reaction trees are typically done by stringing together individual retrosynthetic steps that have the highest predicted confidences. The confidence is typically related to the frequency of the reaction in the training set. This approach has two main problems that this paper addresses. One problem is that "rare" reactions are seldom selected even if they might actually be the most appropriate for a particular problem. The other problem is that you only use local information and "strategical decisions typical of a multi-step synthesis conceived by a human expert".

This paper tries to address these problems by doing the selection of steps differently. The key is to convert the reaction (which are encoded as reaction SMILES) to a fingerprint, i.e. a numerical representation of the reaction SMILES, and using them to compute similarity scores.

For example, in the first step you can use the fingerprint to ensure a diverse selection of reactions to start the synthesis of. In subsequent steps, you can concatenate the individual reaction fingerprints (i.e. the growing string) to compute similarities to reaction paths, rather than individual steps. By selecting paths that are most similar to the training data you could incorporate the "strategical decisions typical of a multi-step synthesis conceived by a human expert". Very clever!

The main problem is how to show that this approach produces better retrosynthetic predictions. Once metric might be shorter paths and the authors to note this but I didn't see any data and it's not necessarily the best metric since, for example important protection/deprotection steps could be missing. The best approach is for synthetic experts to weigh in, but that's hard to do for enough reactions to get good statistics. Perhaps this recent approach would work?


This work is licensed under a Creative Commons Attribution 4.0 International License.