tag:blogger.com,1999:blog-3298010970074740972024-03-13T02:47:04.574+01:00Computational Chemistry HighlightsImportant recent papers in computational and theoretical chemistry
<br>A free resource for scientists run by scientistsComputational Chemistry Highlightshttp://www.blogger.com/profile/12737582958414627004noreply@blogger.comBlogger451125tag:blogger.com,1999:blog-329801097007474097.post-43657777621311764312024-02-28T13:36:00.001+01:002024-02-28T13:36:54.785+01:00AiZynth Impact on Medicinal Chemistry Practice at AstraZeneca<p><a href="http://doi.org/10.1039/D3MD00651D" target="_blank">Jason D. Shields, Rachel Howells, Gillian Lamont, Yin Leilei, Andrew Madin, Christopher E. Reimann, Hadi Rezaei, Tristan Reuillon, Bryony Smith, Clare Thomson, Yuting Zhengc and Robert E. Ziegler (2024)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjckev66uIWcrAxZt-oD9muT-9WFGVo-JdlRcMbPjQWNju7zGBs7IRKGL9ygWxdb4mGmqSuYZynv228d7l8c_ROYBzArsK9g5tFhmdVfb-xOxvsctUALfXxadqiM4bCkxSvR_b0cBNOu8TX2jyxOSH-btdbUBhZ0MPt_UAzP4s9H9cbG8kGrhZ0NT6VsHjP/s1564/Screenshot%202024-02-28%20at%2010.02.26.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1268" data-original-width="1564" height="518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjckev66uIWcrAxZt-oD9muT-9WFGVo-JdlRcMbPjQWNju7zGBs7IRKGL9ygWxdb4mGmqSuYZynv228d7l8c_ROYBzArsK9g5tFhmdVfb-xOxvsctUALfXxadqiM4bCkxSvR_b0cBNOu8TX2jyxOSH-btdbUBhZ0MPt_UAzP4s9H9cbG8kGrhZ0NT6VsHjP/w640-h518/Screenshot%202024-02-28%20at%2010.02.26.png" width="640" /></a></p><p style="text-align: center;">Figure 3 from <a href="http://doi.org/10.1186/s13321-020-00472-1" target="_blank">this paper</a> (c) the authors 2020. Reproduced under the CC-BY license</p><p>This is one of the rare papers where experimental chemists talk candidly about their experiences using ML models developed by others. In this case it is AiZynthFinder, which is developed at AstraZeneca Gothenburg and predicts retrosynthetic paths, while the users are most synthetic chemists at AstraZeneca in the UK, US, and China. The paper is really well written and well worth reading. I'll just include a few quotes below to whet your appetite. </p><p>"New users of AI tools in general are often disappointed by the failure of AI to live up to their expectations, and chemists' interaction with AiZynth is no exception. The first molecule that most new users test is one that they have personally synthesised recently, and AiZynthFinder rarely replicates their route exactly. Due in part to our self-imposed requirement to run fast searches, AiZynthFinder often gets close to a good route. Thus, experienced users seek inspiration from AiZynth rather than perfection."</p><p>"Common problems include proposals that would lead to undesired regioselectivity, functional group incompatibility, or overgeneralisation of precedented reactions to an inappropriate context."</p><p>"Early problems also included protection/deprotection cycles, which had to be intentionally penalised in order to focus AiZynth on productive chemistry. We have found that protecting group strategy is still best decided by the chemist. Thus, the AI proposals discussed in the case studies do not make heavy use of protecting groups, whereas several of the laboratory syntheses do."</p><p><br /></p><p><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a Creative Commons Attribution 4.0 International License.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-12808455011999870442024-01-31T13:09:00.002+01:002024-01-31T13:09:46.097+01:00TS-Tools: Rapid and Automated Localization of Transition States Based on a Textual Reaction SMILES Input<p><a href="https://doi.org/10.26434/chemrxiv-2024-st2tr" target="_blank">Thijs Stuyver (2024)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBBblV_ya1qvaQISP_8mgDfXCzmARZye9lVkntvGtWk4T0oWdiqsgn3gCxlcRd_SFzoI-clFOAtCdjsLp0dNX6cKZh2H1jW9hscq_n9qAZJmO1C7rdEEfmAUEi2GmexnTE3V1tLeKPYrhsoUrH7XY5KqsiJ1RyERQlwY1nlfcRkrECtOz5LKCjymf_ikoe/s1486/Screenshot%202024-01-31%20at%2012.56.35.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1222" data-original-width="1486" height="526" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBBblV_ya1qvaQISP_8mgDfXCzmARZye9lVkntvGtWk4T0oWdiqsgn3gCxlcRd_SFzoI-clFOAtCdjsLp0dNX6cKZh2H1jW9hscq_n9qAZJmO1C7rdEEfmAUEi2GmexnTE3V1tLeKPYrhsoUrH7XY5KqsiJ1RyERQlwY1nlfcRkrECtOz5LKCjymf_ikoe/w640-h526/Screenshot%202024-01-31%20at%2012.56.35.png" width="640" /></a><div style="text-align: center;">Figure 2 from the paper. (c) the author 2024 reproduced under the CC-BY-NC-ND licence</div><div><br /></div><div>This paper caught my eye for several reasons. It's an <a href="https://github.com/chimie-paristech-CTM/TS-tools" target="_blank">open source</a> implementation of Maeda's AFIR method, but modified for double-ended TS searches. The setup is completely automated and interfaced to xTB so it is fast. It's applied to really challenging problems such as solvent assisted bimolecular reactions and uncovers some important shortcomings of the xTB method. <br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a Creative Commons Attribution 4.0 International License.<div class="separator" style="clear: both; text-align: center;"><br /></div><br /></div>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-13748795012492778682023-12-30T12:53:00.003+01:002023-12-30T12:53:54.953+01:00Accurate transition state generation with an object-aware equivariant elementary reaction diffusion model<p><a href="https://doi.org/10.1038/s43588-023-00563-7" target="_blank">Chenru Duan, Yuanqi Du, Haojun Jia, and Heather J. Kulik (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh77I4Jdkj63-wfz0VvQTbi31q-uR9GnTCp32KXwdtJXD1GWXPD14CqRdUrJgh3bVngP3wTzSuWCO0L_XebxulYBkHlEVq6ZLulCrOgRqybtHhLKjSgm7jWQ85kAkifUldXSFrxg__3UIkeKmYm1_k1qgsAx8cVZHa_dzg45j6JZOe_MfcFYU4g232jn5LU/s1872/Screenshot%202023-12-30%20at%2011.45.31.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="648" data-original-width="1872" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh77I4Jdkj63-wfz0VvQTbi31q-uR9GnTCp32KXwdtJXD1GWXPD14CqRdUrJgh3bVngP3wTzSuWCO0L_XebxulYBkHlEVq6ZLulCrOgRqybtHhLKjSgm7jWQ85kAkifUldXSFrxg__3UIkeKmYm1_k1qgsAx8cVZHa_dzg45j6JZOe_MfcFYU4g232jn5LU/w640-h222/Screenshot%202023-12-30%20at%2011.45.31.png" width="640" /></a></p><p style="text-align: center;">Part of Figure 1 from the paper. </p><p>As anyone who has tried it will know, finding TSs is one of the most difficult, fiddly, and frustrating tasks in computational chemistry. While there are several methods aimed at automating the process, they tend to have a mixed success rate or be computationally expensive and, often, both.</p><p>This paper looks to be an important first step in the right direction. The method produces a guess at a TS structure based on the coordinates of the reactants and products. Notably, the input structures need not be aligned or atom mapped! </p><p>The method achieves a median RMSD of 0.08 Å compared to the true TSs and it often so good that single point energy evaluation gives a reliable barrier. The method also predicts a confidence scoring model for uncertainty quantification, which allows you to a priori judge whether such a single point is sufficient or whether a TS search is warranted. The approach allows for accurate reaction barrier estimation (2.6 kcal/mol) with DFT optimizations needed for only 14% of the most challenging reactions.</p><p>So, the method's not going to do away with manual TS searches entirely, but it is going to be invaluable for large scale screening studies. As the authors note, the method can likely also be adapted to the prediction of barrier heights, which could potentially be used to pre-screen reactions on a much, much bigger scale. </p><p>The paper is an important proof-of-concept study, but needs to be trained on much larger data sets (note that it is only trained on C, N, and O containing molecules), which are non-trivial to obtain. But the method could likely be used to obtain these data sets in an iterative fashion.<br /><br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a Creative Commons Attribution 4.0 International License.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-301682249830874172023-11-30T12:15:00.003+01:002023-11-30T12:15:54.901+01:00Growing strings in a chemical reaction space for searching retrosynthesis pathways<p><a href="https://doi.org/10.26434/chemrxiv-2023-rmkwg" target="_blank">Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqj9Nyv1PYbqR21jS-I2YjGPeHf6ZwG_65EuhusK62yeDCxC6Z3kZASYg453gWfsW3EPR6aRGzhdFxGLNR_vV_B7IV3BsctXh-ZlCOCsmXmq2DwBPI4JTGDSgDgPPXr7OlSuquUKHFttf6xsM1-rdB33qRdVTskFKBDKPar6uvwGGj3fYZ3Y1Cp8tccYX5/s1432/Screenshot%202023-11-29%20at%2015.26.14.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="724" data-original-width="1432" height="324" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqj9Nyv1PYbqR21jS-I2YjGPeHf6ZwG_65EuhusK62yeDCxC6Z3kZASYg453gWfsW3EPR6aRGzhdFxGLNR_vV_B7IV3BsctXh-ZlCOCsmXmq2DwBPI4JTGDSgDgPPXr7OlSuquUKHFttf6xsM1-rdB33qRdVTskFKBDKPar6uvwGGj3fYZ3Y1Cp8tccYX5/w640-h324/Screenshot%202023-11-29%20at%2015.26.14.png" width="640" /></a><br /><div style="text-align: center;">Part of Figure 10 from the paper. (c) The authors 2023. Reproduced under the CC-NC-ND</div><div><br /></div><div>Prediction of retrosynthetic reaction trees are typically done by stringing together individual retrosynthetic steps that have the highest predicted confidences. The confidence is typically related to the frequency of the reaction in the training set. This approach has two main problems that this paper addresses. One problem is that "rare" reactions are seldom selected even if they might actually be the most appropriate for a particular problem. The other problem is that you only use local information and "strategical decisions typical of a multi-step synthesis conceived by a human expert".</div><div><br /></div><div><div>This paper tries to address these problems by doing the selection of steps differently. The key is to convert the reaction (which are encoded as reaction SMILES) to <a href="https://doi.org/10.1038/s42256-020-00284-w" target="_blank">a fingerprint</a>, i.e. a numerical representation of the reaction SMILES, and using them to compute similarity scores.</div><div><br /></div><div>For example, in the first step you can use the fingerprint to ensure a diverse selection of reactions to start the synthesis of. In subsequent steps, you can concatenate the individual reaction fingerprints (i.e. the growing string) to compute similarities to reaction paths, rather than individual steps. By selecting paths that are most similar to the training data you could incorporate the "strategical decisions typical of a multi-step synthesis conceived by a human expert". Very clever!</div><div><br /></div><div>The main problem is how to show that this approach produces better retrosynthetic predictions. Once metric might be shorter paths and the authors to note this but I didn't see any data and it's not necessarily the best metric since, for example important protection/deprotection steps could be missing. The best approach is for synthetic experts to weigh in, but that's hard to do for enough reactions to get good statistics. Perhaps this <a href="https://arxiv.org/abs/2310.19796" target="_blank">recent approach</a> would work?</div><div><br /></div><div><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a Creative Commons Attribution 4.0 International License.</div></div>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-7568004006476243552023-10-31T15:42:00.003+01:002023-10-31T15:42:54.561+01:00Few-Shot Learning for Low-Data Drug Discovery<p><a href="https://doi.org/10.1021/acs.jcim.2c00779" target="_blank">Daniel Vella and Jean-Paul Ebejer (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha_40ICsxJCDOtpnqQFj7WDTb02QIqWj9EwEc77QJGKnspnDDRrElJohXGLaUWsyRzQnca9Q29Aw4lngnPhVaHynSr1Gr2pARFsW2_6CQLbuun-MTx5uBR7iP8rR21j0aPyk70spQ092jMPFPc-UVpC3rJ8qh6mXcTsGY_xTv0c-OI9iljTD5poalcVoxP/s558/images_large_ci2c00779_0011.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="558" data-original-width="491" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEha_40ICsxJCDOtpnqQFj7WDTb02QIqWj9EwEc77QJGKnspnDDRrElJohXGLaUWsyRzQnca9Q29Aw4lngnPhVaHynSr1Gr2pARFsW2_6CQLbuun-MTx5uBR7iP8rR21j0aPyk70spQ092jMPFPc-UVpC3rJ8qh6mXcTsGY_xTv0c-OI9iljTD5poalcVoxP/w564-h640/images_large_ci2c00779_0011.jpeg" width="564" /></a></div><div class="separator" style="clear: both; text-align: center;">TOC graphic from the article</div><p>This paper is an update and expansion to <a href="http://doi.org/10.1021/acscentsci.6b00367" target="_blank">this seminal paper</a> by Pande and co-workers (you should definitely read both). It compares the ability to distinguish active and inactive compounds for few-shots methods to more conventional approaches for very small datasets. It concludes that the former outperform the latter for some data sets and not for others, which is surprising given that few-shot methods are designed with very small data sets in mind.</p><p>Few shot methods learn a graph-based embedding that minimizes the distance between samples and their respective class prototypes while maximizing the distance between samples and other class prototypes (where prototypes often are the geometric center of a group of molecules). The training set, which is composed of a "query set" that you are trying to match to a "support" set support set is typically small and changes for each epoch (which is now called episodes) to avoid overfitting.</p><p>In this paper, the largest support set was composed of 20 molecules (10 actives and 10 inactives) sampled (together with the query set) from a set of 128 molecules with a 50/50 split of actives and inactives. The performance was then compared to RF and GNN models trained on 20 molecules.</p><p>My main takeaway from the paper was actually how well the conventional models performed. Especially given the fact that the conventional models actually had smaller training set, since the few-shot methods saw all 128 molecules during training over the course of the training, whereas the conventional methods only saw a subset.</p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-24946239308064899652023-09-30T11:50:00.001+02:002023-09-30T11:50:12.015+02:00Ranking Pareto optimal solutions based on projection free energy<p><a href="https://doi.org/10.1103/PhysRevMaterials.7.093804" target="_blank">Ryo Tamura, Kei Terayama, Masato Sumita, and Koji Tsuda (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /></p><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhoB-B5GejFh3flJ7WHVJmQm7GTz5jXXoVkf5xroMWSxb1NSya5qTQogrUrmbgh9vSEFLJ2ikVNVAbKMrcFPgxddQAQ1V5pYn5-6QKsB4WjWJu51uGdzfUVYkjrqrGoIe8-NSZj61rYfpPhlI-zRUckWjlE4uQgmBSDUoIaZaLasOwfajdBKrTFxWLhFcL/s906/Screenshot%202023-09-30%20at%2011.31.21.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="822" data-original-width="906" height="363" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhoB-B5GejFh3flJ7WHVJmQm7GTz5jXXoVkf5xroMWSxb1NSya5qTQogrUrmbgh9vSEFLJ2ikVNVAbKMrcFPgxddQAQ1V5pYn5-6QKsB4WjWJu51uGdzfUVYkjrqrGoIe8-NSZj61rYfpPhlI-zRUckWjlE4uQgmBSDUoIaZaLasOwfajdBKrTFxWLhFcL/w400-h363/Screenshot%202023-09-30%20at%2011.31.21.png" width="400" /></a></div><p></p><p style="text-align: center;">Figure 1 from the paper. (c) APS 2023. Reproduced under the CC-BY license.</p><p>One of the main challenges in multi-objective optimisation is how to weigh the different objectives to get the desired results. Pareto optimisation can in principle solve this problem, but of you get too many solutions you have to select a subset for testing, which basically involves (manually) weighing the importance of each objective.</p><p>This paper proposes a new way to select the potentially most interesting candidates. The idea is basically to identify the most "novel" candidates to maximise the chances of finding "interesting" properties, They do this by identifying points on the Pareto front with the lowest "density of states" for each objective, i.e. points with few examples in property space.</p><p>The method is presented as a post hoc selection method, but could also be used as a search criteria to help focus the search on these areas of property spaces. </p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><br style="background-color: white; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13.2px;" /></div>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-25792817974174893882023-08-30T14:52:00.003+02:002023-08-30T14:52:48.333+02:00Accelerated dinuclear palladium catalyst identification through unsupervised machine learning<p><a href="http://doi.org/10.1126/science.abj0999" target="_blank">Julian A. Hueffel, Theresa Sperger, Ignacio Funes-Ardoiz, Jas S. Ward, Kari Rissanen, Franziska Schoenebeck (2021</a>)<br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHMHgqVFkTpH3rVn4t-gDGyV6eSWyExhI1AGTI5lluqsAzWIvgP0QEVrMBCOoo3n5WBRIu64ExkKMfBO3XxNkriOrvqyhaG76mb0qItHKgJtgVgeMheWT1ZTnhQti_sw5UcSa8Yi9Nf4iERQaEx3Smtc7RMuA5lADO0980mGMR_VCraei-ffE0UCktQjwR/s1388/Screenshot%202023-08-30%20at%2014.08.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1266" data-original-width="1388" height="584" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHMHgqVFkTpH3rVn4t-gDGyV6eSWyExhI1AGTI5lluqsAzWIvgP0QEVrMBCOoo3n5WBRIu64ExkKMfBO3XxNkriOrvqyhaG76mb0qItHKgJtgVgeMheWT1ZTnhQti_sw5UcSa8Yi9Nf4iERQaEx3Smtc7RMuA5lADO0980mGMR_VCraei-ffE0UCktQjwR/w640-h584/Screenshot%202023-08-30%20at%2014.08.10.png" width="640" /></a></p><p></p><div style="text-align: center;">Figure 1 from the paper. (c) 2021 the authors.</div><br />I've been meaning to highlight this paper for years but forgot. However, in the last week k-means clustering came up twice in two completely unrelated contexts, which reminded me of this beautiful paper where the authors managed to use ML to make successful predictions based only five data points! <p></p><p>Pd catalysts can exist in either in a dimer or monomer form depending on the ligands and there are no heuristic rules for predicting what form will be favoured by a particular ligand. Even DFT-computed dimerization energies fail to give inconsistent predictions.</p><p>The authors started with a database of 348 ligands each characterised with 28 different descriptors, which were dived into eight groups by <a href="https://youtu.be/4b5d3muPQmA?si=Hz6W8KLf7F-WxWlZ" target="_blank">k-mean clustering</a> of the descriptors. The four ligands known to favour dimer formation where found in two clusters, with a combined size of 89 ligands. The prediction is thus that these 89 ligands are more likely to favour dimer formation, compared to the other 256. </p><p>The authors decided to focus on the 66 ligands in the 89 subset that contain P-C bonds and computed 42 new DFT-computed descriptors that explicitly address dimer formation, such as the dimerization energy. Based these and the old descriptors the authors grouped the 66 ligands into six clusters, where two of the clusters, with a combined size of 25, contained the four known dimer-ligands. The prediction is this that the other 21 ligands also should form dimers.</p><p>It's a little unclear, but from I can tell the authors then experimentally tested nine of the 21 ligands, of which seven formed dimers. That's a very good hit rate starting from five data points!</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-85222798656279970972023-07-31T14:50:00.001+02:002023-07-31T15:49:59.367+02:00Real-World Molecular Out-Of-Distribution: Specification and Investigation<p><a href="https://doi.org/10.26434/chemrxiv-2023-q11q4-v2" target="_blank">Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, Emmanuel Noutahi (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuY5B6JsFvFVNWR0P7ljE06ZLvzVYvwNBiIAlQNui3eoElSxuTLCbY2kYfkyjYpJ0z6VaI1EJAWbn8VCUBWbQqPRSL9RDwnOBnB-SFlS-pw35xZTTaj3q9G-qKb9LskARexEPP44XKvIT-bPUTaBQi289KCmIanu-jTwBOuFC9OdMU7kve3SHo8E64pAw3/s556/Screenshot%202023-07-31%20at%2012.43.35.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="389" data-original-width="556" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuY5B6JsFvFVNWR0P7ljE06ZLvzVYvwNBiIAlQNui3eoElSxuTLCbY2kYfkyjYpJ0z6VaI1EJAWbn8VCUBWbQqPRSL9RDwnOBnB-SFlS-pw35xZTTaj3q9G-qKb9LskARexEPP44XKvIT-bPUTaBQi289KCmIanu-jTwBOuFC9OdMU7kve3SHo8E64pAw3/s320/Screenshot%202023-07-31%20at%2012.43.35.png" width="320" /></a></div><div style="text-align: left;"></div><div style="text-align: center;">Part of Figure 1 from <a href="https://vectorinstitute.ai/wp-content/uploads/2021/08/ds_project_report_final_august9.pdf" target="_blank">this report</a></div><p></p><p>Why do ML models perform much worse different test sets? There can be many reasons for such a shift in performance, but the main culprit is often a covariate shift meaning that the training and test set are quite different. This study seeks to quantify this effect for different molecular representations, ML algorithms, and datasets (both regression and classification).</p><p>The authors find that the difference between the test and train error (from a random split) is mostly governed by the representation (as opposed the the ML algorithm). Furthermore, representations that results in shorter distances between molecules (specifically 5-NN distances) on average are the ones that give a smaller difference in error between training and test set. However, those representations do not necessarily result in lower test set errors. </p><p>So you while you can't use representation distances to pick the representation you can use them to pick the best splitting method for obtaining your training set. The best test set it the one that with the shortest overall representation distance to the deployment set (i.e. the set you want to use your ML model on). The authors find that the best splitting method depends on the representation but is often scaffold splitting. </p><p>Thanks to Cas Wogum for a very helpful discussion.</p><p><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-77841528736269378132023-06-26T15:54:00.005+02:002023-06-26T15:54:51.120+02:00Evolutionary Multiobjective Optimization of Multiligand Metal Complexes in Diverse and Vast Chemical Spaces<p><a href="https://doi.org/10.26434/chemrxiv-2023-k3tf2" target="_blank">Hannes Kneiding, Ainara Nova, David Balcells (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGq7GnzlqLVEcAdtLp8fHXmOYLMOa1aJ01waWdxT6siSupLnOv5_MPkZfU0ULrJg0HlAmSP-2p1IevhZwEnOSWSxzKWY6o2mGCYKlWJIkqvBjbNOcHgyzAtVeikJIg3xtWnalHJumbYzn65u6jSW7e9ltD1d46rTtJ1dTT6N18LpPxjZ3C4ipSdDiitx1b/s933/Screenshot%202023-06-26%20at%2015.01.33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="933" data-original-width="752" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGq7GnzlqLVEcAdtLp8fHXmOYLMOa1aJ01waWdxT6siSupLnOv5_MPkZfU0ULrJg0HlAmSP-2p1IevhZwEnOSWSxzKWY6o2mGCYKlWJIkqvBjbNOcHgyzAtVeikJIg3xtWnalHJumbYzn65u6jSW7e9ltD1d46rTtJ1dTT6N18LpPxjZ3C4ipSdDiitx1b/w516-h640/Screenshot%202023-06-26%20at%2015.01.33.png" width="516" /></a></div><div style="text-align: center;">Figure 5 from the paper. (c) 2023 the authors. Reproduced under the CC BY ND license</div><p>The authors show that an NBO analysis can be used to identify the charges (as well as their coordination mode) of individual ligands in TM-complexes. This is a key property needed to properly characterise the ligands and, thus, the complex as a whole. They have manually checked the approach for 500 compounds and finds that it gives reasonable results in 95% of the cases. That number drops to 92% if coordination mode is also considered. They <a href="https://github.com/hkneiding/tmQMg-L" target="_blank">provide</a> these, and many other, properties of 30K ligands extracted from the CSD.</p><p>The NBO analysis is based on PBE/TZV//PBE/DZV calculations, which are a bit costly, but it will be interesting to see whether lower theories (e.g. DZV//xTB) give similar results.</p><p>Based on this knowledge the authors build a data set of 1.37B square-planar Pd compounds and compute their polarizability and HOMO-LUMO gap. They then search this space for molecules with both large polarizabilities and HOMO-LUMO gaps using a genetical algorithm that optimises the Pareto front, and show that optimum solutions can be found by considering only 1% if the entire space. The GA code is not available yet, but should be released soon.</p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-85532135976612566962023-05-30T15:53:00.001+02:002023-05-30T15:53:16.622+02:00Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability<div class="separator" style="clear: both; text-align: left;"><div style="text-align: left;"><a href="https://arxiv.org/abs/2305.08746" style="text-align: left;" target="_blank">Ziming Liu, Eric Gan, Max Tegmark (2023)</a></div><span style="text-align: left;"><div style="text-align: left;">Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></div></span></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUT9rJ3TWkDGjtB1APU7nRzfy2o5nu0GLaiD0UdZOuq9eBKUGXAP8oOOJ38y0BilAQRELbzZN909-tGIY9NC76VT34XLDUxMRC_glpcqpgs_9VDzWxkCTQfsrMAzcMECOfQYHTtTA69E5pgmh7vzIBQCxwDoc65soZJYWAGOMw2EzpAldDaZollPCCNQ/s1160/Screenshot%202023-05-30%20at%2015.21.56.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="777" data-original-width="1160" height="428" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUT9rJ3TWkDGjtB1APU7nRzfy2o5nu0GLaiD0UdZOuq9eBKUGXAP8oOOJ38y0BilAQRELbzZN909-tGIY9NC76VT34XLDUxMRC_glpcqpgs_9VDzWxkCTQfsrMAzcMECOfQYHTtTA69E5pgmh7vzIBQCxwDoc65soZJYWAGOMw2EzpAldDaZollPCCNQ/w640-h428/Screenshot%202023-05-30%20at%2015.21.56.png" width="640" /></a></div><p></p><div style="text-align: center;">Adapted from Figures 1 and 3 in the paper. (c) 2023 the authors </div><br />While this fascinating paper is not about chemistry it could easily be applied to chemical problems without further modifications (except for graph convolution), so I feel justified in highlighting it here.<p></p><p>The paper introduces brain-inspired modular training (BIMT) which leads to relatively simple NNs that are easier to interpret. "Brain-inspired" comes from the fact that the brain is not fully connected like most NNs, since it is a 3D entity with physical connections (axons) and longer axons mean slower communication between neurons. The idea is to enforce this modularity during trainings by assigning positions to individual nodes and introducing a length-dependent penalty in the loss function (in addition to conventional L1 regularisation). This is combined with a swap operation that can swap neurons to decrease the loss.</p><p>The result is much simpler networks that, at least for relatively simple objectives, are intuitive and easier to interpret as you can see from the figure above. </p><p>The code is available <a href="https://github.com/KindXiaoming/BIMT" target="_blank">here</a> (<a href="https://colab.research.google.com/drive/1hggc5Tae97BORVNdesLcwp9og3SmPtM7?usp=sharing" target="_blank">Google Colab version</a>) It would be very interesting to apply this to chemical problems!</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-54620014134924006512023-04-30T12:29:00.000+02:002023-04-30T12:29:07.860+02:00Virtual Ligand Strategy in Transition Metal Catalysis Toward Highly Efficient Elucidation of Reaction Mechanisms and Computational Catalyst Design<p><a href="https://doi.org/10.1021/acscatal.3c00576" target="_blank">Wataru Matsuoka, Yu Harabuchi, and Satoshi Maeda (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0vQaSSZZ1Z0DNynzA5cJxhKClK6sGu2XIySkPpYjJqvBZyV0PekAF6-9sG4AArNpPh6jxdFuLW4oHC3wmQEWxbaGFXTmHbCxo8dyp_cmba1YE3OzPIJXJtjfd5rSdkJ-1i9Berw9O-wyUBxyNWlbtCx7lAIxO2yr1Odm_yVq9K_KZW2Li0JnTsPEqTQ/s500/cs3c00576_0012.webp" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="263" data-original-width="500" height="336" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0vQaSSZZ1Z0DNynzA5cJxhKClK6sGu2XIySkPpYjJqvBZyV0PekAF6-9sG4AArNpPh6jxdFuLW4oHC3wmQEWxbaGFXTmHbCxo8dyp_cmba1YE3OzPIJXJtjfd5rSdkJ-1i9Berw9O-wyUBxyNWlbtCx7lAIxO2yr1Odm_yVq9K_KZW2Li0JnTsPEqTQ/w640-h336/cs3c00576_0012.webp" width="640" /></a></p><p>This perspective shows how an old computational tool can be adapted to serve a new purpose. When I started in compchem changing, say, a few F atoms to and H atoms in a molecule often made the difference between waiting a few days and a few weeks for the calculations to finish. People therefore developed pseudo H atoms that could mimic the electronic effect of larger atoms or even entire functional groups. Some of these methods were later adapted to serve as boundary atoms in QM/MM calculations and now they have found a new use in screening for ligands in organometallic catalysts.</p><p>The use of pseudoatoms to model such ligands not only speeds up the individual calculations but also maps the chemical space on to just two dimensions, electronic and steric, that allows the space to be searched more efficiently. Once the desired combination of electronics and sterics is found corresponding real ligands are found by another, much faster, screen if commercially available or synthetically accessible ligands.</p><p>The authors use this approach to identify two phosphine ligands for a chemoselective Suzuki–Miyaura cross-coupling catalyst, complete with experimental verification.</p><p>The downside is that the parameterisation of these "virtual ligands" are a bit involved and very ligand-dependent. But an interesting approach non-the-less.</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-57191339127297236552023-03-29T10:29:00.000+02:002023-03-29T10:29:17.169+02:00eChem: A Notebook Exploration of Quantum Chemistry<p><a href="https://doi.org/10.1021/acs.jchemed.2c01103" target="_blank">Thomas Fransson, Mickael G. Delcey, Iulia Emilia Brumboiu, Manuel Hodecker, Xin Li, Zilvinas Rinkevicius, Andreas Dreuw, Young Min Rhee, and Patrick Norman (2023)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-ZyH5G-rE4qWmf_5tEGTOunsTts7EBsMXrY4zr9FruG-ZVSZqf9rCXmNrVJNhUTYSeqgTwGRjnfWQQ_hRB00z-VBpMGGS-LMmbngt8g4raQ6T1SN4nwKYAqQ02tMgjTb1mgjk83Tu9VVW6OCoZglWJ7VCckF-3XoC2X-PIJNhDaad9SFCDXo3BeOqfQ/s500/ed2c01103_0004.webp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="269" data-original-width="500" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-ZyH5G-rE4qWmf_5tEGTOunsTts7EBsMXrY4zr9FruG-ZVSZqf9rCXmNrVJNhUTYSeqgTwGRjnfWQQ_hRB00z-VBpMGGS-LMmbngt8g4raQ6T1SN4nwKYAqQ02tMgjTb1mgjk83Tu9VVW6OCoZglWJ7VCckF-3XoC2X-PIJNhDaad9SFCDXo3BeOqfQ/w640-h344/ed2c01103_0004.webp" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><a href="https://doi.org/10.30746/978-91-988114-0-7" target="_blank">eChem</a> is an e-book that mixes text and code to teach quantum chemistry. The code is based on <a href="https://veloxchem.org/docs/intro.html" target="_blank">VeloxChem</a>, which is a Python-based open source quantum chemistry software package. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">While you can use VeloxChem to perform standard quantum chemical calculations, the really cool thing is that it gives you easy access to the <a href="https://kthpanor.github.io/echem/docs/elec_struct/orbitals.html" target="_blank">basis set</a>, <a href="https://kthpanor.github.io/echem/docs/elec_struct/integrals.html" target="_blank">integrals and orbitals</a>, <a href="https://kthpanor.github.io/echem/docs/elec_struct/kernel_int.html" target="_blank">DFT grids and functionals</a>, etc. This in turn allows you to write your own <a href="https://kthpanor.github.io/echem/docs/elec_struct/hf_scf.html" target="_blank">SCF</a> or <a href="https://kthpanor.github.io/echem/docs/elec_struct/dft_scf.html" target="_blank">Kohn-Sham-SCF</a> procedure. It's sorta like <a href="https://www.amazon.com/Modern-Quantum-Chemistry-Introduction-Electronic/dp/0486691861" target="_blank">Szabo and Ostlund</a> updated and taken to the next level. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">If you truly want to understand quantum chemistry this is the way to go! One of the co-authors, Xin Li, very kindly got it <a href="https://colab.research.google.com/drive/1o1IfBPVa0e1VVs4TwqMM4qBECL3xsDFj?usp=sharing" target="_blank">working on Google Colab</a>, so it is very easy to start playing around with it yourself. </div><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-79945661845290535582023-02-27T15:34:00.001+01:002023-02-27T15:34:47.324+01:00Prediction of High-Yielding Single-Step or Cascade Pericyclic Reactions for the Synthesis of Complex Synthetic Targets<p><a href="https://doi.org/10.1021/jacs.2c09830" target="_blank">Tsuyoshi Mita, Hideaki Takano, Hiroki Hayashi, Wataru Kanna, Yu Harabuchi, K. N. Houk, and Satoshi Maeda (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><p><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw0hSwZVUWiFXJaZzn_ShtWvkGDNJXUpygBzmLajYma3SFj94L57ZUDk0LbCRqVfDKtBNowV5o4yqLdPUYFK4h4uzaPf2uBG4uipqR440lx6bJiEeOQMnsfCDo46IOnh09OxKTtFMyx8IwWcj2pirNUSuoJQ1KUugwV--9gX7aJRLtYCWAYJ2tXb11uA/s500/images_medium_ja2c09830_0015.gif" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="217" data-original-width="500" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw0hSwZVUWiFXJaZzn_ShtWvkGDNJXUpygBzmLajYma3SFj94L57ZUDk0LbCRqVfDKtBNowV5o4yqLdPUYFK4h4uzaPf2uBG4uipqR440lx6bJiEeOQMnsfCDo46IOnh09OxKTtFMyx8IwWcj2pirNUSuoJQ1KUugwV--9gX7aJRLtYCWAYJ2tXb11uA/w640-h278/images_medium_ja2c09830_0015.gif" width="640" /></a></p><p>This paper has been on my to-do list for a while, but <a href="https://www.science.org/content/blog-post/pericyclic-reactions-predicted" target="_blank">Derek Lowe beat me to it</a> (<a href="http://www.compchemhighlights.org/2023/01/machine-learning-guided-discovery-of.html" target="_blank">again</a>). DFT-based reaction prediction has yet to make an impact on synthesis planning due to the fact that many are complexities we still have to deal with efficiently, such as solvent effects in ionic mechanisms (very hard to predict accurately), catalysts and additives, chirality, and, well, just the sheer size of the reaction space. </p><p>While these things will be dealt with in good time, it makes sense to see if there are any low-hanging fruits that can be picked under the current limitations, that still have "real life" applications. And this study did just that, by choosing pericyclic reactions. These are very popular reactions in organic synthesis, but require no catalysts nor additives and have minimal solvent effects. Furthermore, some use cases of this reaction in natural product synthesis can be very hard to spot, even for seasoned synthetic chemists, and the authors show that their algorithm can predict it <i>a priori</i>. So this could potentially be a useful tool for specific types synthesis planning.<br /><br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-20395501652381923812023-01-30T11:06:00.001+01:002023-01-30T11:06:42.991+01:00Machine-Learning-Guided Discovery of Electrochemical Reactions<p>Andrew F. Zahrt, Yiming Mo, Kakasaheb Y. Nandiwale, Ron Shprints, Esther Heid, and Klavs F. Jensen (2022)<br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfvNkdahV2tEYdkzwgh1yCALoReswx_WpkUTZMc52hbd7TRTl6Q_MAjlEz9g0BYgKpwzlqtZVo4VLqiZC449FA2aWEgFnZJNvZ7lsYV4DL6vselEbyIsBg4LF829KldPUL8LOy1ZjznuFX2yaSTfXP03U5svcSMnhG7ock5tsOrbaq34nM0zDgIYMQWg/s500/ja2c08997_0011.webp" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="281" data-original-width="500" height="360" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfvNkdahV2tEYdkzwgh1yCALoReswx_WpkUTZMc52hbd7TRTl6Q_MAjlEz9g0BYgKpwzlqtZVo4VLqiZC449FA2aWEgFnZJNvZ7lsYV4DL6vselEbyIsBg4LF829KldPUL8LOy1ZjznuFX2yaSTfXP03U5svcSMnhG7ock5tsOrbaq34nM0zDgIYMQWg/w640-h360/ja2c08997_0011.webp" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;"><a href="https://www.science.org/content/blog-post/searching-wilderness-new-chemistry" target="_blank">Derek Lowe has highlighted the chemical aspects of this work already</a>, so here I focus on the machine learning, which is pretty interesting. The authors want to predict whether a molecule will react with 4-dicyanobenzene anion after it is oxized at a cathode. They have 141 data points of which 42% show a reaction.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">They tested several classification models using Morgan fingerprints as the molecular representation, but got at accuracy of only 60%. The then reasoned that the accuracy could be improved by using DFT features. However, rather than using molecular features they decided to use atomic features from an NBO analysis on the radical cation, neutral, radical anion. The feature vector was then tested on several data sets and shown to perform well.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The question is then how to combine the atomic feature vectors to a molecular representation for the reaction classification. The usual way is graph convolution but that'll require more than 141 data points to optimise. So instead they use <a href="https://github.com/benedekrozemberczki/graph2vec" target="_blank">graph2vec</a>, which is an unsupervised learning method so it is easy to create arbitrarily large training sets. Graph2vec is analogous to word2vec (or, more accurately, doc2vec) which creates vector representations of words by predicting context in text (i.e. words that often appear close to the word of interest). For graph2vec the context is subgraphs of the input graph. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">The graph2vec embedder was then trained on 38k molecules (note that this requires 38k DFT calculations). Using this representation, the accuracy for the reaction classifier increased to 74%, which is a significant improvement compared to Morgan fingerprints. The classifier was then applied to the 38k molecules and 824 were predicted to be reactive. Twenty of these were selected for experimental validation and 16 (80%) were shown to be reactive. That's not a bad hit rate!</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">I was not aware of graph2vec before reading this paper and it seems like a very promising alternative to graph convolution, especially in the low data regime.</div><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-23058544272362176052022-12-30T12:30:00.002+01:002023-02-01T07:38:25.255+01:00On the potentially transformative role of auxiliary-field quantum Monte Carlo in quantum chemistry: A highly accurate method for transition metals and beyond<p><a href="https://doi.org/10.26434/chemrxiv-2022-cw19h" target="_blank">James Shee, John L. Weber, David R. Reichman, Richard A. Friesner, and Shiwei Zhang (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCiiJdnKbsfSzFWDz2nIu5AB3M5wq4HprsIYwe8DNsrLYURS9JgrLwmLjeRS391EISaErHYSeaC8ECEG09XDk9AxQRh-zHeWjBNOVhvkagmoMyIYpkfmBpeGKUpuqYgpaGO1B9x5znBvg_k7fiBrU5iy6jY81TtrTX3Xb_autFaGLfVVNqYT-AU2YJrA/s2124/Screenshot%202022-12-30%20at%2011.21.04.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="956" data-original-width="2124" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCiiJdnKbsfSzFWDz2nIu5AB3M5wq4HprsIYwe8DNsrLYURS9JgrLwmLjeRS391EISaErHYSeaC8ECEG09XDk9AxQRh-zHeWjBNOVhvkagmoMyIYpkfmBpeGKUpuqYgpaGO1B9x5znBvg_k7fiBrU5iy6jY81TtrTX3Xb_autFaGLfVVNqYT-AU2YJrA/w640-h288/Screenshot%202022-12-30%20at%2011.21.04.png" width="640" /></a></div><div style="text-align: center;">Figure 1 from <a href="https://arxiv.org/abs/1711.02242" target="_blank">this paper</a>. (c) the authors</div><p></p><div>This paper highlights a big problem in the field of quantum chemistry and posits that a solution may be right around the corner. The problem is that we still can't routinely predict the thermochemistry of TM-containing compounds with the same degree of accuracy as we can for organic molecules. The main reason is that the former systems often have a high-degree of non-dynamic correlation which means that our CCSD(T) often does not give reliable results. We can model the non-dynamic correlation with CASSCF, but there is no good way to compute the dynamic correlation based on a CASSCF wavefunction. So when different DFT functional results give wildly different predictions for your TM-compound there is no way to tell which method, if any, if the best.</div><div><br /></div><div>This paper argues that phaseless auxiliary-field quantum Monte Carlo (ph-AFQMC) may be the solution to this problem. ph-AFQMC represents the ground state as a stochastic linear combination of Slater determinants mapped as open-ended random walks starting from a trial wavefunction. The method accounts for both non-dynamic and dynamic correlation and the paper argues that chemical accuracy can be achieved with a few hundred random walks, which can be run in parallel and on GPUs.</div><div><br /></div><div>So what's missing? According to the authors some of the improvements needed include: more efficient ways of reaching the CBS limit, more efficient random walks and a general, automatable protocol to generate optimal trial wave functions. Let's hope these improvements will be made soon, so we can explore a much larger portion of chemical space with confidence.</div><div><br /></div><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-42987183946678136472022-11-30T14:39:00.001+01:002022-11-30T14:39:43.956+01:00Quantum Chemical Data Generation as Fill-In for Reliability Enhancement of Machine-Learning Reaction and Retrosynthesis Planning<p><a href="https://doi.org/10.26434/chemrxiv-2022-gd0q9" target="_blank">Alessandra Toniato, Jan P. Unsleber, Alain C. Vaucher, Thomas Weymuth, Daniel Probst, Teodoro Laino, and Markus Reiher (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWSukLbGuOs0xqHQSjGmPI4tGHJIHST05noK5M88zJyDRx1ju2ArbR3rLKNq45gUEB3HftoOEhUDsekIWbJ5TgHzW5fm5D-y9iG6FNNa-d7PJG9W6zayVX5BEFVkdqfoAubt7ChyMYdmiRYtLX24rIG4Fh-Ql0R3TgwyRXK1p1tA8FrrNTAAOKyU7n3A/s1920/Screenshot%202022-11-30%20at%2014.09.13.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="902" data-original-width="1920" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWSukLbGuOs0xqHQSjGmPI4tGHJIHST05noK5M88zJyDRx1ju2ArbR3rLKNq45gUEB3HftoOEhUDsekIWbJ5TgHzW5fm5D-y9iG6FNNa-d7PJG9W6zayVX5BEFVkdqfoAubt7ChyMYdmiRYtLX24rIG4Fh-Ql0R3TgwyRXK1p1tA8FrrNTAAOKyU7n3A/w640-h300/Screenshot%202022-11-30%20at%2014.09.13.png" width="640" /></a></div><div style="text-align: center;">Part of Figure 7 from the paper. (c) The authors 2022. Reproduced under the CC BY NC ND 4.0 license</div><p></p><p>This is the first paper I have seen on combining automated QM-reaction prediction with ML-based retrosynthesis prediction. The idea itself is simple: for ML-predictions with low confidence (i.e. few examples in the training data) can automated QM-reaction prediction be used to check whether the proposed reaction is feasible, i.e. whether it is the reaction path with the lowest barrier? If so, it could also be used to augment the training data.</p><p>The paper considers two examples using the <a href="https://arxiv.org/abs/2202.13011" target="_blank">Chemoton 2.0 method</a>: one where the reaction is an elementary reaction and one where there are two steps (the Friedel-Crafts reaction shown above). It works pretty well for the former, but runs into problems for the latter.</p><p>One problem for non-elementary reactions is that one can't predict which atoms are chemically active from the overall reaction. Chemoton therefore must consider reactions involving all atom pairs and preferably more pairs of atoms simultaneously. The number of required calculations quickly gets out of hand and the authors conclude that "For such multistep reactions, new methods to identify the individual elementary steps will have to be developed to maintain the exploration within tight bounds, and hence, within reasonable computing time." </p><p>However, even when they specify the two elementary steps for the Friedel-Crafts reaction, their method fails to find the second elementary step. The reason for this failure is not clear but could be due to the semiempirical xTB used for efficiency.</p><p>So the paper presents an interesting and important challenge to computational chemistry community. I wish more papers did this.</p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-27185266137508626412022-10-31T15:13:00.001+01:002022-10-31T15:13:31.350+01:00Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory<p><a href="https://arxiv.org/abs/2210.11682" target="_blank">Frank Hu, Francis He, David J. Yaron (2022)</a> <br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1Ts6CLH06mzF3UvSzxtwAhfHW9Dc_HfbTqC90UkfyN_y3zV52fUSYWebLl2p6l_CRCQkPy6AK1UWY7nTnRU9_KoRQjPtYJCddxz8G92pVlgKWNraJNVOkKcTB6iYiXhQMn5F5PvIaLfKCjG7wRTxtXYV2qn8de-tenFK6Yo3IYoxGdOB5k2nQUQ5PLw/s1282/Screenshot%202022-10-31%20at%2014.37.29.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="450" data-original-width="1282" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1Ts6CLH06mzF3UvSzxtwAhfHW9Dc_HfbTqC90UkfyN_y3zV52fUSYWebLl2p6l_CRCQkPy6AK1UWY7nTnRU9_KoRQjPtYJCddxz8G92pVlgKWNraJNVOkKcTB6iYiXhQMn5F5PvIaLfKCjG7wRTxtXYV2qn8de-tenFK6Yo3IYoxGdOB5k2nQUQ5PLw/w640-h224/Screenshot%202022-10-31%20at%2014.37.29.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;">Figure 7 from the paper. (c) The authors 2022. Reproduced under the BY-NC-ND licence</div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">This paper uses ML techniques and algorithms (specifically PyTorch) to fit DFTB parameters, which results in a semiempirical quantum method (SQM) that has an accuracy similar to DFT. The advantage of such a physics-based method over a pure ML-based is that it is likely to be more transferable and requires much less training data. This should make it much easier to extend to other elements and new molecular properties, such as barriers.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Parameterising SQMs is notoriously difficult as the molecular properties depend exponentially on many of the parameters. As a result, most SQMs used today have parameterised by hand. The paper presents several methodological tricks to automate the fitting.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">One is the use of high-order polynomial spline functions to describe how the Hamiltonian elements depend the fitting-parameters. The functions allow the computation of not only of the first derivative needed for back propagation, but also high-order derivatives, which are used for regularisation to avoid overfitting and keeping the parameters physically reasonable. Finally, the SCF and training loops are inverted to that the he charge fluctuations needed for the Fock operator are updated based on the current model parameters every 10 epochs. This enables computationally efficient back propagation during training, which is important because the training set is on the order of 100k.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">Another neat feature is that the final model is simply a parameter file (SKF file), which can be read by most DFTB programs. So there is nothing new for the user to implement. However, currently the implementation is only for CNHO.</div><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<p></p>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-33298346160935488832022-09-30T10:49:00.003+02:002022-09-30T10:49:54.732+02:00Active Learning for Small Molecule pKa Regression; a Long Way To Go<p><a href="https://doi.org/10.26434/chemrxiv-2022-8w1q0" target="_blank">Paul G. Francoeur, Daniel Peñaherrera, and David R. Koes (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYlrycYY61sJi1PeDM9M14YQx3fBE6_UvzqBVVgrRpR5FqatvBC12LQMAFu5MKH1NkdwCNK1npAnOc8suX_IMzwfzu007kc4NDStOVq6O0ohvZaptW5UdIyK4K35aVv6zFg7CCN1tKp3Ga0IlWJhk0yvCycVE6c3pULxkMHcHVIV5PjAX4M7UR02e2VA/s801/Screenshot%202022-09-29%20at%2014.27.58.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="477" data-original-width="801" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYlrycYY61sJi1PeDM9M14YQx3fBE6_UvzqBVVgrRpR5FqatvBC12LQMAFu5MKH1NkdwCNK1npAnOc8suX_IMzwfzu007kc4NDStOVq6O0ohvZaptW5UdIyK4K35aVv6zFg7CCN1tKp3Ga0IlWJhk0yvCycVE6c3pULxkMHcHVIV5PjAX4M7UR02e2VA/w640-h382/Screenshot%202022-09-29%20at%2014.27.58.png" width="640" /></a></div><p style="text-align: center;">Parts of Figures 5 and 6. (c) The authors 2022. Reproduced under the CC-BY licence</p><p>One approach to active learning is to grow the training set with molecules for which the current model has the highest uncertainties. However, according to this study, this approach does not seem to work for small-molecule pKa prediction where active learning and random selection give the same results (within the relatively high standard deviations) for three different uncertainty estimated. </p><p>The authors show that there are molecules in the pool that can increase the initial accuracy drastically, but that the uncertainties don't seem to help identify these molecules. The green curve above is obtained by exhaustively training a new model for every molecule in the pool during each step of the active learning loop and selecting the molecule that gives the largest increase in accuracy for the test set. Note that the accuracy decreases towards the end meaning that including some molecules in the training set diminishes the performance.</p><p>The authors offer the following explanation for their observations: "We propose that the reason active learning failed in this pKa prediction task is that all of the molecules are informative."</p><p>That's certainly not hard to imagine given the is the small size of the initial training set (50). It would have been very instructive to see the distribution of uncertainties for the initial models. Does every molecule have roughly the same (high) uncertainty? If so, the uncertainties would indeed not be informative. </p><p>Also, uncertainties only correlate with (random) errors on average. The authors did try adding molecules in batches, but the batch size was only 10. </p><p>It would have been interesting to see the performance if one used the actual error, rather than the uncertainties, to select molecules. That would test the case where uncertainties correlate perfectly with the errors.</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-34378537612586468882022-08-30T14:29:00.003+02:002022-09-12T09:49:33.188+02:00Is there evidence for exponential quantum advantage in quantum chemistry?<p><a href="https://arxiv.org/abs/2208.02199" target="_blank">Seunghoon Lee, Joonho Lee, Huanchen Zhai, Yu Tong, Alexander M. Dalzell, Ashutosh Kumar, Phillip Helms, Johnnie Gray, Zhi-Hao Cui, Wenyuan Liu, Michael Kastoryano, Ryan Babbush, John Preskill, David R. Reichman, Earl T. Campbell, Edward F. Valeev, Lin Lin, Garnet Kin-Lic Chan (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /></p><div style="text-align: left;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh28nPfndd20bRDQjQ3jFrywFux_jeuZ4DqCysq4so9Ed6rqEWerRRIQhVn56NeXEqShwG5PGGFDkMJjgSpQ0B86wN151qn4oET-E1BvWZAPeGQ-crRzPwxwjxkNb3vQJlfnxYlpYmm8eDPfDcdMlF1Sle3J0pp4Dqu0_rnSmSJPaSVvwNjRucUe2HBYA/s652/Screenshot%202022-08-30%20at%2013.36.02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="349" data-original-width="652" height="342" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh28nPfndd20bRDQjQ3jFrywFux_jeuZ4DqCysq4so9Ed6rqEWerRRIQhVn56NeXEqShwG5PGGFDkMJjgSpQ0B86wN151qn4oET-E1BvWZAPeGQ-crRzPwxwjxkNb3vQJlfnxYlpYmm8eDPfDcdMlF1Sle3J0pp4Dqu0_rnSmSJPaSVvwNjRucUe2HBYA/w640-h342/Screenshot%202022-08-30%20at%2013.36.02.png" width="640" /></a></div><p></p><p style="text-align: center;">Figure 1 from the paper. (c) 2022 the authors. Reproduced under the CC-BY licence.</p><p><a href="https://cen.acs.org/articles/95/i43/Chemistry-quantum-computings-killer-app.html" target="_blank">Quantum chemical calculations are widely seen as one of quantum computings killer app's.</a> This paper examines the available evidence for this assertion and doesn't find any. </p><p>The potential of quantum computing rests on two assumptions: that the cost of quantum computer calculations on chemical systems scales polynomially with system size, while the corresponding calculations on classical computers scale exponentially. </p><p>The former assumption is true for the actual quantum "computation" and the latter assertion is true for the Full CI solution. However, this paper suggests that preparing the state for the quantum "computation" may scale exponentially with system size, and that we don't need Full CI accuracy and that chemically accurate methods such as coupled-cluster based method scale polynomially with system size for a given desired accuracy.</p><p>The argument for the potential exponential scaling for system preparation is as follows: If you want the energy of the ground state you have to provide a guess at the ground state wavefunction that resembles the exact wavefunction as much as possible. More precisely, the probability of obtaining the ground state energy scales as $S^{-2}$, where S is the overlap between the trial and exact wavefunction. The authors show that $S$ scales exponentially with system size for a series of Fe-S clusters, which suggests an overall exponential dependence for the quantum computations.</p><p>The argument for polynomial scaling of chemically accurate quantum chemistry calculations has two parts: "normal" organic molecules and strongly correlated systems. </p><p>The former is pretty straight-forward: no one knowledgeable is really arguing that CCSD(T)-level accuracy is insufficient for ligand-protein binding energies and CCSD(T) scales polynomially with system size. So the simple notion of accelerating drug discovery by computing this with quantum computers does not hold water.</p><p>However, CCSD(T) does not work for strongly correlated systems and we don't have any real practical alternative for which we can test the scaling. Instead the authors look at simpler model of strongly correlated systems and demonstrate polynomial scaling with system size. </p><p>As the authors are carefull to point out, none of this represents a rigorous proof of anything. But it is far from obvious that quantum chemistry is the killer app for quantum computing that most people seem to think it is. </p><p><a href="https://www.youtube.com/watch?v=O-uSrQuxV68&t=992s" target="_blank">In addition to the paper you can find a very clear lecture on the topic here.</a><br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-42646023106962791272022-07-31T13:31:00.001+02:002022-07-31T13:31:24.554+02:00Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization<p><a href="https://arxiv.org/abs/2206.12411" target="_blank">Wenhao Gao, Tianfan Fu, Jimeng Sun, Connor W. Coley (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgys0RbjqHvj-iuPuINfUuuLNp6a79gfGQwG7A05vI6WB_2pAwFwNom5jaFbBY4ZQ6e-nxaVZk0XCXUVdD7NxL0ZPsMq9UxDM66opGtppUJf5Ir5pHktJ24vn9BuRzZR88TnEpZri1QInOODMCTh5LdQd5nY8AfvGCk4-QC4bReGxXg71GhPsntz87D_Q/s2126/Screenshot%202022-07-31%20at%2013.04.08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="656" data-original-width="2126" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgys0RbjqHvj-iuPuINfUuuLNp6a79gfGQwG7A05vI6WB_2pAwFwNom5jaFbBY4ZQ6e-nxaVZk0XCXUVdD7NxL0ZPsMq9UxDM66opGtppUJf5Ir5pHktJ24vn9BuRzZR88TnEpZri1QInOODMCTh5LdQd5nY8AfvGCk4-QC4bReGxXg71GhPsntz87D_Q/w640-h198/Screenshot%202022-07-31%20at%2013.04.08.png" width="640" /></a></p><p></p><div style="text-align: center;">Figure 1 from the paper. (c) The authors 2022. Reproduced under the CC-BY license.</div><br />The development of generative models that can find molecules with certain properties has become very popular but there are very few studies that compare them, so it's hard to know what works best. This study compares the performance of 25 different generative models in 23 different optimisation tasks and draws some very interesting conclusions.<p></p><p>None of these methods find the optimum value given an "budget" of 10,000 oracle evaluations and for some tasks the best performance is not exactly impressive. This doesn't bode well for some real life applications where even a few hundred property evaluations are challenging. </p><p>Some methods are slower to converge than others, so you might draw completely different conclusions regarding efficiency if you 100,000 oracle evaluations. Similarly, some methods have high variability in performance so you might draw very different conclusions from 1 run compared to 10 runs. This is especially a consideration for problems when you can only afford one run. It might be better to choose a method that performs slightly worse on average but is less variable, rather than risk a bad run from a highly variable method that performs better on average.</p><p><a href="http://doi.org/10.1186/s13321-017-0235-x">The method that performed best overall</a> is one of the oldest methods, published in 2017! </p><p>Food for thought</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-44515234192015828892022-06-29T12:50:00.000+02:002022-06-29T12:50:10.680+02:00Deep Learning Metal Complex Properties with Natural Quantum Graphs<p><a href="https://doi.org/10.26434/chemrxiv-2022-fd43k" target="_blank">Hannes Kneiding, Ruslan Lukin, David Balcells (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcX2tUCPWymuvdtuOJLbNJCMh9gWqz3vgpGhlT7jBLO4ILsOSBGWZmZOkdx5pXZPq9cwFquO8CfOo-MMOqrl-ZTs1qaNzeKIy8wpO9LLU4u0vMY_HQnMLuDEbeXgvndBguKElmYCjjN8kMgSEjxiiIGeNTaBybC5BzJ868BPR3TZXg_kSU30-RRWdExQ/s1634/Screenshot%202022-06-29%20at%2010.33.21.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1082" data-original-width="1634" height="424" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcX2tUCPWymuvdtuOJLbNJCMh9gWqz3vgpGhlT7jBLO4ILsOSBGWZmZOkdx5pXZPq9cwFquO8CfOo-MMOqrl-ZTs1qaNzeKIy8wpO9LLU4u0vMY_HQnMLuDEbeXgvndBguKElmYCjjN8kMgSEjxiiIGeNTaBybC5BzJ868BPR3TZXg_kSU30-RRWdExQ/w640-h424/Screenshot%202022-06-29%20at%2010.33.21.png" width="640" /></a><br /></p><div style="text-align: center;">Figure 2 from the paper (c) The authors. Reproduced under the CC-BY-NC-ND 4.0 license</div><div style="text-align: center;"><br /></div><div style="text-align: left;">While there's been a huge amount of ML work on organic molecules, there as been comparatively little on trantition metal complexes (TMCs). One of the reasons is that many of the cheminformatics tools we take for granted are harder to apply to TMCs due to their more complex bonding situations. This makes bond perception and computing node-features like formal atomic charges, and hence graph representations, quite tricky. Which, in turn, makes standard ML tools like binary finger prints or graph-convolution NNs tricky to apply to TMCs.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This paper suggest using data from DFT/NBO calculations to create so-called "quantum graphs", where the edges are determined using both bonding orbitals and bond-orders while node- and edge-features are derived from other NBO properties.</div><p></p><p>This representation is combined with two graph-NN methods (MPNN and MXMNet) and trained against DFT properties such as the HOMO-LUMO gap. The results are quite good and generally better than radius graph methods such as SchNet. However, one should keep in mind that both the descriptors and properties are computed with DFT.</p><p>Given that the computational cost of the descriptors is basically the same as the property of interest, this is a proof-of-concept paper that shows the utility of the general idea. However, it remains to be seen whether cheaper descriptors (e.g. based on semi-empirical calculations) result in similar performance. However, given the current sparcity of ML tools for TMCs this is a very welcome advance.</p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p><br /></p>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-16745874335981732452022-05-30T12:57:00.000+02:002022-05-30T12:57:18.146+02:00Computer-designed repurposing of chemical wastes into drugs<p><a href="https://doi.org/10.1038/s41586-022-04503-9" target="_blank">Agnieszka Wołos, Dominik Koszelewski, Rafał Roszak, Sara Szymkuć, Martyna Moskal, Ryszard Ostaszewski, Brenden T. Herrera, Josef M. Maier, Gordon Brezicki, Jonathon Samuel, Justin A. M. Lummiss, D. Tyler McQuade, Luke Rogers & Bartosz A. Grzybowski (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /></p><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPQekWuWpc4UMPjReix_UtfeAuQ8ZjrRJLI6-lbIwHhhSbOtOEGP7IKFXINZ-N_oJ89yxoBlpADG7GPClX7QuKwlaZC7AfdsDTteArXUBAHc0W1Ntc_8WWvJULeC2M3AanFwnPG3VStIzMzLs7XbaCBA152fjUYU_8uOq7oAGnxvfr27WhkUtMpDL9mg/s517/Screenshot%202022-05-30%20at%2010.59.59.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="386" data-original-width="517" height="299" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPQekWuWpc4UMPjReix_UtfeAuQ8ZjrRJLI6-lbIwHhhSbOtOEGP7IKFXINZ-N_oJ89yxoBlpADG7GPClX7QuKwlaZC7AfdsDTteArXUBAHc0W1Ntc_8WWvJULeC2M3AanFwnPG3VStIzMzLs7XbaCBA152fjUYU_8uOq7oAGnxvfr27WhkUtMpDL9mg/w400-h299/Screenshot%202022-05-30%20at%2010.59.59.png" width="400" /></a></div><div style="text-align: center;">Figure 2a from the paper. (c) 2022 the authors</div><div style="text-align: left;"><br /></div><div>When I talk to people about retrosynthesis prediction the often mention that synthetic chemists don't tend to use them. There are many reasons for that including various shortcomings of the suggested routes but also the fact that, from a time saving perspective, the retrosynthesis planning makes up a small part of the synthesis process. One common answer to this is "OK, but wait til the robots arrive", but there are several important applications that are applicable right now. </div><div><br /></div><div>For example, on my own research in de novo molecule discovery I'm often left with hundreds of promising molecules where the only remaining selection criterion is ease of synthesis. Here I routinely use retrosynthesis programs to rank the molecules in terms of number of synthesis steps to make the shortlist of 10-20 molecules that can be presented to experimental collaborators. </div><div><br /></div><div>This paper presents another example of science that would be impossible without these computational tools. The authors search for reaction networks that connect 189 small molecule waste by-products from chemical industry to 4113 high-value molecules (approved drugs and agrochemicals). The use a reaction prediction algorithm called Allchemy to iteratively generate increasingly complicated molecules and, at each step, bias the search towards the target. Among the 300 million molecules that result from this process the were able to identify 167 target molecules, with an average of 216 synthetic paths per target. The synthetic paths are further ranked using a complicated scoring functions that accounts for all sorts of practical considerations, since aim is to produce large quantities of each target, and a few of the paths are experimentally verified on the kg scale.</div><div><br /></div><div>One interesting part the approach is the prediction of reaction conditions, which is done in terms of categories: e.g. protic/aprotic and polar/nonpolar solvents, and very low, low, room temperature, high, and very high temperatures. This makes a lot more sense to than trying to predict the exact solvent or temperature.</div><br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<p></p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-27585571704250714252022-04-27T15:00:00.000+02:002022-04-27T15:00:14.106+02:00Pairwise Difference Regression: A Machine Learning Meta-algorithm for Improved Prediction and Uncertainty Quantification in Chemical Search<p><a href="https://doi.org/10.1021/acs.jcim.1c00670" target="_blank">Michael Tynes, Wenhao Gao, Daniel J. Burrill, Enrique R. Batista, Danny Perez, Ping Yang, and Nicholas Lubbers (2021)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyF4KydSb1MN8OZqIit-P-STOkAi048V86KDlwx6Twv9LaUxyNQi5QrwOc5Wm-UPzaI-FlUls8IrvO6S7smNf7jw46HY7jv9jjbrnTXGG6Pqcg0tM52wZT57znJsBr8WrUyOGFFeXxTyOFCon4UIhfWee6RYNM2Af3Iv2KHXoHKLZ8tRxEwLmfb1YR8g/s500/images_medium_ci1c00670_0010.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="277" data-original-width="500" height="354" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyF4KydSb1MN8OZqIit-P-STOkAi048V86KDlwx6Twv9LaUxyNQi5QrwOc5Wm-UPzaI-FlUls8IrvO6S7smNf7jw46HY7jv9jjbrnTXGG6Pqcg0tM52wZT57znJsBr8WrUyOGFFeXxTyOFCon4UIhfWee6RYNM2Af3Iv2KHXoHKLZ8tRxEwLmfb1YR8g/w640-h354/images_medium_ci1c00670_0010.gif" width="640" /></a></div><div style="text-align: center;">TOC picture from the paper (c) 2021 ACS</div><p></p>This paper tries to solve two problems at once: data augmentation for small data sets and a method-independent uncertainty quantification (UQ). <div><br /></div><div>Data augmentation is quite common in areas like image classification where images can be perturbed (e.g. rotated by a few degrees) and still be recognisable. However, this is difficult in chemistry where small perturbations in structure can have a non-negligible effect on properties. For text-based molecular representation once can use non-canonical smiles for augmentation, but there is no generally applicable method.</div><div><br /></div><div>Similarly, most UQ methods are specific to the machine learning model-type, with the exception of ensemble methods that requires the training and deployment of several models, which can be expensive.</div><div><br /></div><div>The paper offers a simple solution to both. The method is trained to reproduce the ground truth <i>difference</i> for all $n^2$ molecule pairs thereby increasing the training set size significantly. When making a prediction for a new molecule, the model predicts the differences relative to all training set molecules with the standard deviation serving as a measure of prediction uncertainty. Pretty neat idea and easy to implement! The main change is to construct molecular representations for the molecule pairs but the authors outline one easy-to-implement approach.</div><div><br /></div><div>Depending on the task and training set size the data augmentation decreases the MAE by 3-40%. UQ quality is notoriously difficult to quantify, but the method appears to give uncertainties similar to those obtained by a random forest method.<br /><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</div>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-58586881228176661182022-03-29T15:00:00.000+02:002022-03-29T15:00:17.449+02:00Machine Learning May Sometimes Simply Capture Literature Popularity Trends: A Case Study of Heterocyclic Suzuki−Miyaura Coupling<p><a href="https://doi.org/10.1021/jacs.1c12005" target="_blank">Wiktor Beker, RafałRoszak, Agnieszka Wołos, Nicholas H. Angello, Vandana Rathore, Martin D. Burke, and Bartosz A. Grzybowski (2022)</a><br />Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><br /><br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkbz3pSY3muRTd4FcJC7A8kv78mSZvn7ObN0r8TbnOuy2gqIeUKzWe_vT7_GXLCsdwYlg1xJqv0LczQU2TjOuhtalEf9Z_cTwssItp3nd85SJxoyJa3piH9Q9f78wjhRlQGkzIrjKuX2xcDnkn_AaOuBLK53IbOnyiIXi8VdFiR2URMlUcYaTmIMQOwA/s500/ja1c12005_0003.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="174" data-original-width="500" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkbz3pSY3muRTd4FcJC7A8kv78mSZvn7ObN0r8TbnOuy2gqIeUKzWe_vT7_GXLCsdwYlg1xJqv0LczQU2TjOuhtalEf9Z_cTwssItp3nd85SJxoyJa3piH9Q9f78wjhRlQGkzIrjKuX2xcDnkn_AaOuBLK53IbOnyiIXi8VdFiR2URMlUcYaTmIMQOwA/w640-h222/ja1c12005_0003.gif" width="640" /></a><br /><br /></p><p>What do you infer from this quote from the paper (emphasis added)?</p><p></p><blockquote>Another important problem, tackled herein, deals with the prediction of optimal conditions for a particular reaction in which there are generally multiple viable choices of solvents or reagents. Several works[21−24] have attempted to use ML for the prediction of reaction conditions, and the overall message they seem to convey is that ML can, in fact, offer accurate predictions provided adequate numbers of literature examples on which to build the models (but see also critical ref 6). However, here, we demonstrate with a case study that this may have been an overoptimistic interpretation, and that even with large quantities of carefully curated literature data, ML approaches may not perform <i>considerably better </i>than estimates based on the popularity of reaction conditions reported in the literature. In other words, these ML models do not provide <i>significantly more</i> insights than just suggesting the most popular conditions which could be obtained by simple statistics over literature examples[25,26] and no “machine intelligence.”</blockquote>I can tell you what I inferred. References 21-24 used ML models to predict optimal reaction conditions, but failed to check whether they "provide significantly more insights than just suggesting the most popular conditions". I also inferred that the results from this study suggests that, had the authors checked, they would have found that not to be the case. <p></p><p>However, the four references refer to two papers (<a href="http://doi.org/10.1126/science.aar5169" target="_blank">21</a> and <a href="https://doi.org/10.1021/acs.accounts.0c00770" target="_blank">23</a>) by Doyle and co-workers on the prediction of reaction yields (<i>not conditions</i>) and two papers, one by Coley and co-workers and one by Reisman and co-workers (<a href="http://doi.org/10.1021/acscentsci.8b00357" target="_blank">22</a> and <a href="https://dx.doi.org/10.1021/acs.jcim.0c01234" target="_blank">24</a>, respectively), on the prediction of reaction conditions <i>with</i> <i>comparison to popularity baselines</i>. </p><p>The paper looks at the prediction of solvent and base (and not catalysts and temperature as implied by the TOC graphic above) for ca 10,000 Suzuki coupling reactions from Reaxys. The best top-1 accuracy for base and solvent for ML are 80.6% and 51.7%, compared to popularity baseline values of 76.8% and 29.8%. The authors use the term "significantly" (and related terms) without ever quantifying what they deem significant, but to me the ML solvent predictions seem significantly better than the popularity baseline. </p><p>Furthermore, as Coley and co-workers point out the true metric is the accuracy of the combined prediction, e.g. correct solvent <i>and</i> base. For example, in the case of correct catalysts <i>and</i> solvent <i>and</i> reagent Coley and co-workers found an accuracy of 57.3% compared to a popularity baseline of only 5.7%. However, I am not even certain whether Grzybowski and co-workers would deem that a significant improvement.</p><p>On a more constructive note, the topic of the paper does relate to an interesting fundamental question in ML on how to deal with imbalances data, i.e. where there is a a very popular single choice. One would perhaps naively suspect that this would be easier for a machine to learn, i.e. you just have to learn a few exceptions. But how to you typically learn exceptions? By memorising them, and we tend to employ many ML techniques to avoid just this. </p><p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p><div class="separator" style="clear: both; text-align: center;"><br /></div><br />Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0tag:blogger.com,1999:blog-329801097007474097.post-62425504245561242942022-02-28T13:42:00.000+01:002022-02-28T13:42:09.187+01:00Findings hits among billions of molecules<p><a href="https://doi.org/10.1038/s41586-021-04220-9" target="_blank">Assaf Alon, Jiankun Lyu, Joao M. Braz, Tia A. Tummino, Veronica Craik, Matthew J. O’Meara, Chase M. Webb, Dmytro S. Radchenko, Yurii S. Moroz, Xi-Ping Huang, Yongfeng Liu, Bryan L. Roth, John J. Irwin, Allan I. Basbaum, Brian K. Shoichet & Andrew C. Kruse. Structures of the σ2 receptor enable docking for bioactive ligand discovery (2021)</a></p><p><a href="https://doi.org/10.1038/s41586-021-04220-9" target="_blank">Arman A. Sadybekov, Anastasiia V. Sadybekov, Yongfeng Liu, Christos Iliopoulos-Tsoutsouvas, Xi-Ping Huang, Julie Pickett, Blake Houser, Nilkanth Patel, Ngan K. Tran, Fei Tong, Nikolai Zvonok, Manish K. Jain, Olena Savych, Dmytro S. Radchenko, Spyros P. Nikas, Nicos A. Petasis, Yurii S. Moroz, Bryan L. Roth, Alexandros Makriyannis & Vsevolod Katritch Synthon-based ligand discovery in virtual libraries of over 11 billion compounds (2021)</a></p>Highlighted by <a href="https://twitter.com/janhjensen">Jan Jensen</a><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjIiJoUsdoh8D_OIjmWzZnD-LCA0LREaAObjFaouS5m1rmpvz4a_Vy0TltlAuCCjEy6EUPpAOd-OKmQkRmpZLvYbUdpZ5bdYJErWSx76ohhgNmyE6CCO1i-SBIjQ9ML_siPd1CO-MHiog6me-RgRhcqUemGiFaC2m9xWI884RLkFYmr53GZzeR337tFfQ=s1054" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="425" data-original-width="1054" height="258" src="https://blogger.googleusercontent.com/img/a/AVvXsEjIiJoUsdoh8D_OIjmWzZnD-LCA0LREaAObjFaouS5m1rmpvz4a_Vy0TltlAuCCjEy6EUPpAOd-OKmQkRmpZLvYbUdpZ5bdYJErWSx76ohhgNmyE6CCO1i-SBIjQ9ML_siPd1CO-MHiog6me-RgRhcqUemGiFaC2m9xWI884RLkFYmr53GZzeR337tFfQ=w640-h258" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;">Figure 2a and b from Alon <i>et al</i>. (c) 2021 Nature</div><p>The recent developments in make-on-demand molecular libraries present an interesting methodological challenge to virtual screening. Not too long ago, such a library would have hundreds of millions and even 1 billion molecules and there was still a chance to <a href="http://www.compchemhighlights.org/2019/02/ultra-large-library-docking-for.html" target="_blank">dock a significant portion of these libraries</a>. However, the sizes of the libraries have grown to well beyond 20 billion and show no sign of stopping. There is no way wholesale docking can keep up with this growth so new approaches are needed. </p><p>One computational approach that has kept up with the growth of make-on-demand libraries is similarity searching. It is still possible to search these enormous libraries for similar molecules in just a few minutes. </p><p>Alon et al. uses this general idea to select and dock 490 million molecules with properties that are similar to known binders to the target. Based on the docking scores they prioritised 577 molecules of which 484 were successfully made and 127 showed good activity against the target. 20,000 analogues of the four best candidates are then extracted from among 28 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 105 best candidates were made and tested leading to further improvement in the measured affinities.</p><p>Sadybekov et al. essentially docks the individual building blocks used in the make-on-demand library and then combined the best-scoring fragments into about 1 million molecules for a second round of docking. Using this approach they identified 80 promising candidates of which 60 could be synthesised. Of these 60 molecules, 21 proved active. 920 analogues of the three best candidates are then extracted from among 11 billion molecules in the Enamine REAL Space make-on-demand library, and docked. The 121 best candidates were made and tested leading to further improvement in the measured affinities.</p><p>There are several take home messages here. </p><p>The percentage of active compounds against a particular target in library is very small, so you don't get a lot of useful hits until you work with these enormous libraries.</p><p>Docking <i>does</i> help in identifying active compounds. Docking has a bad rep in certain circles and I have seen several people refer to them as "random number generators" but studies like these show that this is not the case. Sure, if one expects an excellent, or even respectable, correlation coefficient between docking scores and binding affinities, one will be sorely disappointed. However, as these studies show, molecules with good docking scores have a much higher chance at being active than molecules with bad docking scores. </p><p>The success rate seems to be about 30-50% depending on the target. So if you are in the lower end and only able to make and test a handful of candidates (which is often the case for academic studies), there's a reasonable chance you won't find any actives and conclude that docking is useless. It's only when you are able to make and test dozens of molecules that you see that docking is working for you. The make-on-demand libraries now makes such numbers feasible for academics.</p><p>Finally, several of the co-authors on the two papers I highlight are Ukrainian and are, along with their families and friends, likely in grave danger right now as their country is being attacked by Putin and his ilk. </p><br /><img src="http://i.creativecommons.org/l/by/4.0/88x31.png" /><br />This work is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</div>Jan Jensenhttp://www.blogger.com/profile/08595894308946022740noreply@blogger.com0