Monday, June 28, 2021

Bayesian optimization of nanoporous materials

Aryan Deshwal, Cory M. Simon, and Janardhan Rao Doppa (2021)
Highlighted by Jan Jensen

Figure 5 from the paper. (c) the authors. Reproduced under the CC-BY license.

This is another example of searching chemical space for systems with extreme property values by continuously updating a surrogate ML model of the property. I wrote about another such example, by Graff et al., here, but the main new thing here (IMO) is the low number of property evaluations needed to train the surrogate model.

The property of interest is the methane deliverable capacity (y) of covalent organic frameworks (COFs) which has been predicted by expensive MD calculations for ca 70,000 COFs. Ten randomly selected datapoints are used to train a Gaussian Process (GP) surrogate model. Bayesian optimisation (BO) is then used to identify the COF that is most likely to improve the surrogate model (based on the predicted y-value and the uncertainty of of the prediction), which is re-evaluated using MD. The MD value then added to the training set and the process is repeated for up to 500 steps. 

Already after 100 steps (110 MD evaluations including the initial training set), the best COF is identified as are 25% of the top-100 COFs, which is quite impressive. For comparison, the smallest training set in the previous study by Graff et al. is 100 and they need a training set of 300 to get to 25%. On the other hand, Graff et al. get up to ca 70% of the top 100 with a training set of 500, compared to ca 50% in this study (but the chemical space of Graff et al. is only 10,000 so it's a bit hard to compare).

The main lesson (IMO) is that's it's worth trying to start with very small training sets for these approaches.

This work is licensed under a Creative Commons Attribution 4.0 International License.


  1. Hi Jensen,

    Thanks for the nice summary.

    A small correction in "Bayesian optimisation (BO) is then used to identify the COF with the largest y-value, which is re-evaluated using MD."

    This is true only for the "pure exploitation" setting. BO selects the the COF with the highest utility function score by trading-off exploitation and exploration. We use expected improvement (EI) as our utility function in the experiments.

  2. Thank for your comment. I have updated the post accordingly.