Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen


This work is licensed under a Creative Commons Attribution 4.0 International License.
Important recent papers in computational and theoretical chemistry
A free resource for scientists run by scientists
Federico Zipoli, Carlo Baldassari, Matteo Manica, Jannis Born, and Teodoro Laino (2023)
Highlighted by Jan Jensen
Daniel Vella and Jean-Paul Ebejer (2023)
Highlighted by Jan Jensen
This paper is an update and expansion to this seminal paper by Pande and co-workers (you should definitely read both). It compares the ability to distinguish active and inactive compounds for few-shots methods to more conventional approaches for very small datasets. It concludes that the former outperform the latter for some data sets and not for others, which is surprising given that few-shot methods are designed with very small data sets in mind.
Few shot methods learn a graph-based embedding that minimizes the distance between samples and their respective class prototypes while maximizing the distance between samples and other class prototypes (where prototypes often are the geometric center of a group of molecules). The training set, which is composed of a "query set" that you are trying to match to a "support" set support set is typically small and changes for each epoch (which is now called episodes) to avoid overfitting.
In this paper, the largest support set was composed of 20 molecules (10 actives and 10 inactives) sampled (together with the query set) from a set of 128 molecules with a 50/50 split of actives and inactives. The performance was then compared to RF and GNN models trained on 20 molecules.
My main takeaway from the paper was actually how well the conventional models performed. Especially given the fact that the conventional models actually had smaller training set, since the few-shot methods saw all 128 molecules during training over the course of the training, whereas the conventional methods only saw a subset.
Ryo Tamura, Kei Terayama, Masato Sumita, and Koji Tsuda (2023)
Highlighted by Jan Jensen
Figure 1 from the paper. (c) APS 2023. Reproduced under the CC-BY license.
One of the main challenges in multi-objective optimisation is how to weigh the different objectives to get the desired results. Pareto optimisation can in principle solve this problem, but of you get too many solutions you have to select a subset for testing, which basically involves (manually) weighing the importance of each objective.
This paper proposes a new way to select the potentially most interesting candidates. The idea is basically to identify the most "novel" candidates to maximise the chances of finding "interesting" properties, They do this by identifying points on the Pareto front with the lowest "density of states" for each objective, i.e. points with few examples in property space.
The method is presented as a post hoc selection method, but could also be used as a search criteria to help focus the search on these areas of property spaces.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Julian A. Hueffel, Theresa Sperger, Ignacio Funes-Ardoiz, Jas S. Ward, Kari Rissanen, Franziska Schoenebeck (2021)
Highlighted by Jan Jensen
Pd catalysts can exist in either in a dimer or monomer form depending on the ligands and there are no heuristic rules for predicting what form will be favoured by a particular ligand. Even DFT-computed dimerization energies fail to give inconsistent predictions.
The authors started with a database of 348 ligands each characterised with 28 different descriptors, which were dived into eight groups by k-mean clustering of the descriptors. The four ligands known to favour dimer formation where found in two clusters, with a combined size of 89 ligands. The prediction is thus that these 89 ligands are more likely to favour dimer formation, compared to the other 256.
The authors decided to focus on the 66 ligands in the 89 subset that contain P-C bonds and computed 42 new DFT-computed descriptors that explicitly address dimer formation, such as the dimerization energy. Based these and the old descriptors the authors grouped the 66 ligands into six clusters, where two of the clusters, with a combined size of 25, contained the four known dimer-ligands. The prediction is this that the other 21 ligands also should form dimers.
It's a little unclear, but from I can tell the authors then experimentally tested nine of the 21 ligands, of which seven formed dimers. That's a very good hit rate starting from five data points!
This work is licensed under a Creative Commons Attribution 4.0 International License.
Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, Emmanuel Noutahi (2023)
Highlighted by Jan Jensen
Why do ML models perform much worse different test sets? There can be many reasons for such a shift in performance, but the main culprit is often a covariate shift meaning that the training and test set are quite different. This study seeks to quantify this effect for different molecular representations, ML algorithms, and datasets (both regression and classification).
The authors find that the difference between the test and train error (from a random split) is mostly governed by the representation (as opposed the the ML algorithm). Furthermore, representations that results in shorter distances between molecules (specifically 5-NN distances) on average are the ones that give a smaller difference in error between training and test set. However, those representations do not necessarily result in lower test set errors.
So you while you can't use representation distances to pick the representation you can use them to pick the best splitting method for obtaining your training set. The best test set it the one that with the shortest overall representation distance to the deployment set (i.e. the set you want to use your ML model on). The authors find that the best splitting method depends on the representation but is often scaffold splitting.
Thanks to Cas Wogum for a very helpful discussion.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Hannes Kneiding, Ainara Nova, David Balcells (2023)
Highlighted by Jan Jensen
The authors show that an NBO analysis can be used to identify the charges (as well as their coordination mode) of individual ligands in TM-complexes. This is a key property needed to properly characterise the ligands and, thus, the complex as a whole. They have manually checked the approach for 500 compounds and finds that it gives reasonable results in 95% of the cases. That number drops to 92% if coordination mode is also considered. They provide these, and many other, properties of 30K ligands extracted from the CSD.
The NBO analysis is based on PBE/TZV//PBE/DZV calculations, which are a bit costly, but it will be interesting to see whether lower theories (e.g. DZV//xTB) give similar results.
Based on this knowledge the authors build a data set of 1.37B square-planar Pd compounds and compute their polarizability and HOMO-LUMO gap. They then search this space for molecules with both large polarizabilities and HOMO-LUMO gaps using a genetical algorithm that optimises the Pareto front, and show that optimum solutions can be found by considering only 1% if the entire space. The GA code is not available yet, but should be released soon.
The paper introduces brain-inspired modular training (BIMT) which leads to relatively simple NNs that are easier to interpret. "Brain-inspired" comes from the fact that the brain is not fully connected like most NNs, since it is a 3D entity with physical connections (axons) and longer axons mean slower communication between neurons. The idea is to enforce this modularity during trainings by assigning positions to individual nodes and introducing a length-dependent penalty in the loss function (in addition to conventional L1 regularisation). This is combined with a swap operation that can swap neurons to decrease the loss.
The result is much simpler networks that, at least for relatively simple objectives, are intuitive and easier to interpret as you can see from the figure above.
The code is available here (Google Colab version) It would be very interesting to apply this to chemical problems!
This work is licensed under a Creative Commons Attribution 4.0 International License.