Showing posts with label jmadsen. Show all posts
Showing posts with label jmadsen. Show all posts

Monday, May 18, 2020

Open Graph Benchmark: Datasets for Machine Learning on Graphs

A diverse collection of datasets for use in ML applications to graphs has been collected by Hu et al. The Benchmark is intuitively structured and includes evaluation protocols and metrics. Furthermore, the authors have reported the measured performance of a few popular approaches within each application (e.g., ROC-AUC,PRC-AUC, hits, or accuracy). There are several datasets in all three classes of tasks: Node property prediction (ogbn-), link property (ogbl-) prediction, and graph property prediction (ogbg-). 

Of particular interest to those of us who work in biochemistry broadly defined are the SMILES molecular graphs adapted from MoleculeNet [2] such as ogbg-molhiv (HIV) and ogbg-pcba (PubChem Bio Assay); however, also ogbl-ppa (Protein-Protein Association) and ogbn-proteins (Protein-Protein Association) are of interest. Note that MoleculeNet is not included in its entirety - far from it. So, that resource is definitely also interesting to have a close look at if you have not already explored it.

If you are the competitive type, your efforts can be submitted to scoreboards at the hosting website: https://ogb.stanford.edu

Monday, February 10, 2020

On the Completeness of Atomic Structure Representations


Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples. 


FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point Cor C− is added leading to two degenerate environments, and − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.   

Wednesday, September 25, 2019

Deflate to Understand Complex Molecular Kinetics

Contributed by Jesper Madsen


Dimensionality reduction is at the core of understanding and making intuitive sense of complex dynamic phenomena in chemistry.  It is usually assumed that the slowest mode is the one of primary interest; however, it is critical to realize that this is not always so! A conceptual example hereof is a protein folding simulation (Lindorff-Larsen et al. Science 334, 517-520, 2011) where the slowest dynamical mode is not the folding itself (see Figure). What is the influence, then, of “non-slowest” modes in this process and how can it most appropriately be elucidated?

FIG: Figure 2 from the preprint: "(A) Sampled villin structures from the MD trajectory analyzed. Helical secondary structure is colored and coils are white. Each image represents five structures sampled from similar locations in TIC space as determined by a 250-center k-means model built upon the first three original TICs. The purple structure represents the folded state, and the blue structure represents the denatured state. The green structure is a rare helical misfolded state that we assert is an artifact. (B) Two-dimensional histograms for TICA transformations constructed from villin contact distances. Dashed lines indicate the regions corresponding to the sampled structures of the same color. The first TIC tracks the conversion to and from the rare artifact only. The second TIC tracks the majority of the folding process and correlates well with RMSD to the folded structure."



This work by Husic and Noé show how deflation can provide an answer to these questions. Technically speaking deflation is a collection of methods for how to modify a matrix after the largest eigenvalue is known in order to find the rest. In their provided example of the folding simulation, the dominant Time-lagged Independent Component (TIC) encapsulates the "artifact" variation that we are not really interested in. Thus, a constructed kinetic (Markov-state) model will be contaminated in several undesirable ways as discussed by the authors in great detail.  

In principle, this should be a very common problem since chemical systems have complex Hamiltonians. Perhaps the reason why we don’t see it discussed more is that ultra-rare events – real or artifact – may not usually be sampled during conventional simulations. So, with the increasing computational power available to us, and simulations approaching ever-longer timescales, this is likely something that we need to be able to handle. This preprint describes well how one can think about attacking these potential difficulties.   

Tuesday, March 19, 2019

Artificial Intelligence Assists Discovery of Reaction Coordinates and Mechanisms from Molecular Dynamics Simulations

Contributed by Jesper Madsen

Here, I highlight a recent preprint describing an application of Artificial Intelligence/Machine Learning (AI/ML) methods to problems in computational chemistry and physics. The group previously published the intrinsic map dynamics (iMapD) method, which I also highlighted here on Computational Chemistry Highlights. The basic idea in the previous study was to use an automated trajectory-based approach (as opposed to a collective variable-based approach) to explore the free-energy surface a computationally expensive Hamiltonian that describes a complex biochemical system.

Fig 1: Schematic flow chart of the AI-assisted MD simulation algorithm.


The innovation in their current approach is the combination of the sampling scheme, statistical inference, and deep learning to construct a framework where sampling and mechanistic interpretation happens simultaneously – an important milestone towards completely “autonomous production and interpretation of MD simulations of rare events,” as the authors themselves remark.

It is reassuring to see that the method correctly identifies known results for benchmark cases (the alanine dipeptide and LiCl dissociation) and out-competes traditional approaches such as transition path sampling in terms of efficiency. In these simple model cases, however, complexity is relatively low and sampling is cheap. I will be looking forward to seeing the method applied to a much more complex problem in the future; E.g. a problem where ergodicity is a major issue other challenges, such as hysteresis, plays a significant role.

Another much appreciated aspect of general interest in this paper that I am emphasizing is the practical approach to interpretation of the constructed neural networks. All in all, there are many useful comments and observations in this preprint and I would recommend reading it thoroughly for those who seek to use modern AI-based methods on molecular simulations.

Wednesday, March 14, 2018

DeePCG: A Deep Neural Network Molecular Force Field


DeePCG: constructing coarse-grained models via deep neural networks. L Zhang, J Han, H Wang, R Car, Weinan E. arXiv:1802.08549v2 [physics.chem-ph]
Contributed by Jesper Madsen

The idea of “learning” a molecular force field (FF) using neural networks can be traced back to Blank et al. in 1995.[1] Modern variations (reviewed recently by Behler[2]), such as the DeePCG scheme[3] that I highlight here, seem to have two key innovations to set them apart from earlier work: network depth and atomic environment descriptors. The latter was the topic of my recent highlight and Zhang et al.[3] take advantage of similar ideas.
Figure 1: “Schematic plot of the neural network input for the environment of CG particle i, using water as an example. Red and white balls represent the oxygen and the hydrogen atoms of the microscopic system, respectively. Purple balls denote CG particles, which, in our example, are centered at the positions of the oxygens.)” from ref. [3]    
Zhang et al. simulate liquid water using ab initio molecular dynamics (AIMD) on the DFT/PBE0 level of theory in order to train a coarse-grained (CG) molecular water model. The training is done by a standard protocol used in CGing where mean forces are fitted by minimizing a loss-function (the natural choice is the residual sum of squares) over the sampled configurations. CGing liquid water is difficult because of the necessity of many-body contributions to interactions, especially so upon integrating out degrees-of-freedom. One would therefore expect that a FF capable of capturing such many-body effects to perform well, just as DeePCG does, and I think this is a very nice example of exactly how much can be gained by using faithful representations of atomic neighborhoods instead of radially symmetric pair potentials. Recall that traditional force-matching, while provably exact in the limit of the complete many-body expansion,[4] still shows non-negligible deviations from the target distributions for most simple liquids when standard approximations are used.

FF transferability, however, is likely where the current grand challenge is to be found. Zhang et al. remark that it would be convenient to have an accurate yet cheap (e.g., CG) model for describing phase transitions in water. They do not attempt this in the current preprint paper, but I suspect that it is not *that* easy to make a decent CG model that can correctly get subtle long-range correlations right at various densities, let alone different phases of water and ice, coexistences, interfaces, impurities (non-water moieties), etc. Machine-learnt potentials continuously demonstrate excellent accuracy over the parameterization space of states or configurations, but for transferability and extrapolations, we are still waiting to see how far they can get.

References

[1] Neural network models of potential energy surfaces. TB Blank, SD Brown, AW Calhoun, DJ Doren. J Chem Phys 103, 4129 (1995)
[2] Perspective: Machine learning potentials for atomistic simulations. J Behler. J Chem Phys 145, 170901 (2016)
[3] DeePCG: constructing coarse-grained models via deep neural networks. L Zhang, J Han, H Wang, R Car, Weinan E. arXiv:1802.08549v2 [physics.chem-ph]
[4] The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models. WG Noid, J-W Chu, GS Ayton, V Krishna, S Izvekov, GA Voth, A Das, HC Andersen. J Chem Phys 128, 244114 (2008)