Computational Chemistry Highlights: computation

Showing posts with label computation. Show all posts

Monday, May 18, 2020

Open Graph Benchmark: Datasets for Machine Learning on Graphs

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, Jure Leskovec. arXiv:2005:00687v1
Contributed by Jesper Madsen

A diverse collection of datasets for use in ML applications to graphs has been collected by Hu et al. The Benchmark is intuitively structured and includes evaluation protocols and metrics. Furthermore, the authors have reported the measured performance of a few popular approaches within each application (e.g., ROC-AUC,PRC-AUC, hits, or accuracy). There are several datasets in all three classes of tasks: Node property prediction (ogbn-), link property (ogbl-) prediction, and graph property prediction (ogbg-).

Of particular interest to those of us who work in biochemistry broadly defined are the SMILES molecular graphs adapted from MoleculeNet [2] such as ogbg-molhiv (HIV) and ogbg-pcba (PubChem Bio Assay); however, also ogbl-ppa (Protein-Protein Association) and ogbn-proteins (Protein-Protein Association) are of interest. Note that MoleculeNet is not included in its entirety - far from it. So, that resource is definitely also interesting to have a close look at if you have not already explored it.

If you are the competitive type, your efforts can be submitted to scoreboards at the hosting website: https://ogb.stanford.edu

[1] Hu, Fey, Zitnik, Dong, Ren, Liu, Catasta, Leskovec. (2020) arXiv:2005:00687v1
[2] MoleculeNet: a benchmark for molecular machine learning. Wu, Ramsundar, Feinberg, Gomes, Geniesse, Pappu, Leswing, Pande. Chem Sci (2018) 9: 513
[3] https://ogb.stanford.edu (2020)

Monday, February 10, 2020

On the Completeness of Atomic Structure Representations

Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, Gábor Csányi, Michele Ceriotti. arXiv:2001:11696v1
Contributed by Jesper Madsen

Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples.

FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point C+ or C− is added leading to two degenerate environments, X + and X − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.

Wednesday, September 25, 2019

Deflate to Understand Complex Molecular Kinetics

Brooke E. Husic & Frank Noé. arXiv: 1907.04101

Contributed by Jesper Madsen

Dimensionality reduction is at the core of understanding and making intuitive sense of complex dynamic phenomena in chemistry. It is usually assumed that the slowest mode is the one of primary interest; however, it is critical to realize that this is not always so! A conceptual example hereof is a protein folding simulation (Lindorff-Larsen et al. Science 334, 517-520, 2011) where the slowest dynamical mode is not the folding itself (see Figure). What is the influence, then, of “non-slowest” modes in this process and how can it most appropriately be elucidated?

FIG: Figure 2 from the preprint: "(A) Sampled villin structures from the MD trajectory analyzed. Helical secondary structure is colored and coils are white. Each image represents five structures sampled from similar locations in TIC space as determined by a 250-center k-means model built upon the first three original TICs. The purple structure represents the folded state, and the blue structure represents the denatured state. The green structure is a rare helical misfolded state that we assert is an artifact. (B) Two-dimensional histograms for TICA transformations constructed from villin contact distances. Dashed lines indicate the regions corresponding to the sampled structures of the same color. The first TIC tracks the conversion to and from the rare artifact only. The second TIC tracks the majority of the folding process and correlates well with RMSD to the folded structure."

This work by Husic and Noé show how deflation can provide an answer to these questions. Technically speaking deflation is a collection of methods for how to modify a matrix after the largest eigenvalue is known in order to find the rest. In their provided example of the folding simulation, the dominant Time-lagged Independent Component (TIC) encapsulates the "artifact" variation that we are not really interested in. Thus, a constructed kinetic (Markov-state) model will be contaminated in several undesirable ways as discussed by the authors in great detail.

In principle, this should be a very common problem since chemical systems have complex Hamiltonians. Perhaps the reason why we don’t see it discussed more is that ultra-rare events – real or artifact – may not usually be sampled during conventional simulations. So, with the increasing computational power available to us, and simulations approaching ever-longer timescales, this is likely something that we need to be able to handle. This preprint describes well how one can think about attacking these potential difficulties.

Tuesday, December 23, 2014

Computational Chemistry: 2014 in numbers

In 2014 we learnt that two of the germinal DFT papers (by Becke and Lee, Yang and Parr) are amongst the top ten most cited scientific papers of all time, and of chemistry papers published in the last ten years, the fourth and fifth most cited again relate to computational research (Truhlar and Hess, respectively). In this vein I thought it would be interesting to perform a (pseudo)-scientific analysis of the usage of computation in chemistry research in, and in the years leading up to, 2014 as judged by bibliometric data.

Searching all 2014 chemistry papers in the Web of Science for mention of "computation" or "computational" in either the article title, abstract or keywords suggests that approximately 2.7% of chemistry research involved computation of some variety this year (9,101 of a staggering 331,699 papers). This is most likely an underestimate since searching for more specific phrases such as "DFT" will turn up more hits. The same analysis over previous years reveals a steady increase in the proportion of chemistry research using computation from 0.6% in 1994, to 1.1% in 2004 and 2.2% in 2010.

Around 20% of all the computational chemistry papers published in 2014 emanate from the USA, more than double the closest competitor, China. The top ten nations in terms of publications are USA 19.5%, China 9.3%, Germany 6.1%, India 4.3%, France 4.0%, Italy 3.8%, Spain 3.7%, England 3.6%, Japan 2.7% and Canada 2.4% - making nearly 60% of the total output. A decade ago in 2004 the ten most prolific countries accounted for around 87% of total output, which indicates that recent years have witnessed a greater global involvement in computational chemistry. Noticeable trends are seen in individual nations share of the computational chemistry pie, with the USA and some European nations effectively halving their fraction of papers between 2004 and 2014, with China's output nearly doubling from 5.4% in 2004 to 9.3% in 2014 and the emergence of India from outside the top ten into fourth place in 2014. It should be borne in mind that the globalization of science will inevitably lead to some over counting of papers here, due to multiple addresses appearing on the same paper.

Pages