Monday, February 10, 2020

On the Completeness of Atomic Structure Representations

Here, I highlight an interesting recent preprint that tries to formalize and quantify something that I previously have posted here at Computational Chemistry Highlights (see the post on Atomistic Fingerprints here), namely how to best describe atomic environments in all their many-body glory. A widely held perception among practitioners of the "art" of molecular simulation is that while we usually restrict ourselves to 2-body effects for efficiency purposes, 3-body descriptions uniquely specify the atomic environment (up to a rotation and permutation of like atoms). Not the case (!) and the authors effectively debunk this belief with several concrete counter-examples. 

FIG. 1: "(a) Two structures with the same histogram of triangles; (angles 45, 45, 90, 135, 135, 180 degrees) (b) A manifold of degenerate pairs of environments: In addition to three points A,B,B′ a fourth point Cor C− is added leading to two degenerate environments, and − . (c) Degeneracies induce a transformation of feature space so that structures that should be far apart are brought close together."

Perhaps the most important implication of the work is that it in part helps us understand why modern machine-learning (ML) force fields appears to be so successful. At first sight the conclusion we face is daunting: for arbitrarily high accuracy, no n-point correlation cutoff may suffice to reconstruct the environment faithfully. Why, then, can recent ML force fields so accurately be used to calculate extensive properties such as the molecular energy? According to the results of Pozdnyakov, Willatt et al.'s work, low-correlation order representations often suffice in practice because, as they state, "the presence of many neighbors or of different species (that provide distinct “labels” to associate groups of distances and angles to specific atoms), and the possibility of using representations centred on nearby atoms to lift the degeneracy of environments reduces the detrimental effects of the lack of uniqueness of the power spectrum [the power spectrum is equivalent to the 3-body correlation, Madsen], when learning extensive properties such as the energy." However, the authors do suggest that introducing higher order invariants that lift the detrimental degeneracies might be a better approach in general. In any case, the preprint raises many technical and highly relevant issues; and it would be well worth going over if you don't mind getting in the weeds with Maths.