Sunday, May 31, 2020

Learning Molecular Representations for Medicinal Chemistry

Kangway V. Chuang, Laura M. Gunsalus, and Michael J. Keiser (2020)
Highlighted by Jan Jensen


Figure 3 from the paper. (c) ACS 2020.

I found this miniperspective a very enjoyable read. It covers much more than the title suggests (at least to me), such as a mini history of deep learning in MedChem, when to use deep learning and when to use other ML techniques such as regression of random forest (see the figure above), and some of the fundamental challenges of using ML and generative models in MedChem (just to name a few). 

I found the last topic particularly interesting and include two of my favourite quotes from the paper below, but I really recommend that you read the entire paper.

Critically, small-molecule drug discovery breaks standard assumptions in many technological applications of machine learning. Most machine learning algorithms operate on the assumption that training and testing data are independently and identically distributed (the i.i.d. assumption). For example, we would expect a standard image classifier trained to exclusively distinguish cats from dogs to generalize to new images of cats and dogs. This model will likely produce nonsensical classifications if asked to evaluate pictures of humans. In stark contrast, real-world drug-discovery breaks this standard i.i.d.assumption. The optimization and design of small molecules necessarily explore structural variations drawn from intentionally novel regions of chemical space. Large structural changes to small-molecule hits are typically required to become a lead. For a model to be useful to the practicing medicinal chemist, it must generalize to out-of-distribution examples.
Critically, if generative models are to guide drug design, they cannot merely produce trivial extensions of the training data set. It remains unclear whether the latent spaces of generative models, which effectively interpolate across the chemical space of the training data, are capable of usefully extrapolating into new regions of chemical structure space. Furthermore, current generative models are torn between novelty and accessibility.



This work is licensed under a Creative Commons Attribution 4.0 International License.

Monday, May 18, 2020

Open Graph Benchmark: Datasets for Machine Learning on Graphs

A diverse collection of datasets for use in ML applications to graphs has been collected by Hu et al. The Benchmark is intuitively structured and includes evaluation protocols and metrics. Furthermore, the authors have reported the measured performance of a few popular approaches within each application (e.g., ROC-AUC,PRC-AUC, hits, or accuracy). There are several datasets in all three classes of tasks: Node property prediction (ogbn-), link property (ogbl-) prediction, and graph property prediction (ogbg-). 

Of particular interest to those of us who work in biochemistry broadly defined are the SMILES molecular graphs adapted from MoleculeNet [2] such as ogbg-molhiv (HIV) and ogbg-pcba (PubChem Bio Assay); however, also ogbl-ppa (Protein-Protein Association) and ogbn-proteins (Protein-Protein Association) are of interest. Note that MoleculeNet is not included in its entirety - far from it. So, that resource is definitely also interesting to have a close look at if you have not already explored it.

If you are the competitive type, your efforts can be submitted to scoreboards at the hosting website: https://ogb.stanford.edu