Saturday, June 27, 2026

Developing Pharmaceutically Relevant Pd-Catalyzed C−N Coupling Reactivity Models Leveraging High-Throughput Experimentation

Seung Kyun Ha, Dipannita Kalyani, Michael S. West, Jessica Xu, Yu-hong Lam, Thomas Struble, Spencer Dreher, Shane W. Krska, Stephen L. Buchwald, and Klavs F. Jensen (2025)
Highlighted by Jan Jensen

Yield prediction is one of the most difficult and important challenges for machine learning applied to chemistry. This paper is a useful contribution because it provides a relatively large and systematic high-throughput dataset of ca. 4000 Pd-catalyzed C−N coupling reactions, spanning a wide variety of secondary amines and aryl bromides relevant to medicinal chemistry.

One important caveat is that the study does not address full reaction-condition optimization. All scope reactions are run using a single set of reaction conditions: one catalyst, one base, and one solvent system. The task is therefore better described as substrate-scope prediction under fixed conditions, rather than general prediction of reaction yield across arbitrary reaction conditions.

Significantly, the authors provide a useful reality check on the quality of yield data. For 32 repeated reactions, the measured product Liquid Chromatography Area Percent (LCAP) values correlate poorly, with (R^2 = 0.35). This experimental variability motivates their decision to treat the problem as binary classification rather than regression. A threshold of 20% product LCAP is chosen to define a “successful” reaction, and the repeated reactions are then consistent under this classification scheme in 27 cases. This supports a broader cautionary point: if yields from carefully controlled HTE experiments are already noisy at the level of absolute values, then predicting precise yield values from heterogeneous literature or web-scraped data is likely to be extremely difficult, and perhaps unrealistic in many settings.

The authors construct four different test sets to ascertain whether ML models can be used to extrapolate to unseen amines (amine OSS), aryl bromides (ArX OSS), or both (Both OSS) in addition to standard interpolation (DRS(n) where n is the percentage of the dataset used for training).

The authors compare several model classes and molecular representations, including random forests, decision trees, AdaBoost, fully connected neural networks, and MPNNs using Chemprop. Input features include one-hot encodings, Morgan fingerprints, quantum-mechanical fingerprints, molecular graphs, and combinations of these. Overall, the best models are usually either random forests with fingerprint-based descriptors, sometimes augmented with QM descriptors for the reacting components, or MPNNs. However, the optimal model and representation depend on the data split, which is itself an important result: there is no single universally best model for all generalization tasks.

The best model for each split is then used to design a corresponding prospective validation library of 96 reactions. The models are first retrained on the full experimental dataset using the best architecture, input features, and hyperparameters identified from the retrospective modeling. For the DRS validation library, the DRS25 settings are used. Each validation library is constructed so that approximately half of the reactions are predicted to give >20% LCAP and half are predicted to give <20% LCAP. The confidence threshold is >0.9 for the Amine OOS, ArX OOS, and DRS libraries, and >0.8 for the Both OOS library. For OOS amines or aryl halides, the selected substrates must also have a maximum Tanimoto similarity <0.7 to the corresponding substrates used in the model-building dataset. Thus, the validation libraries are not random samples of chemical space; they are enriched for reactions where the model is sufficiently confident.

The prospective validation results are impressive. For the Amine OOS library, the RF model gives 11 false positives, and no false negatives. For the ArX OOS library, the MPNN gives 3 false positives, and 2 false negatives. For the Both OOS library, the RF model performs less well but still gives useful enrichment, with most errors arising from false positives rather than false negatives. For the DRS25 library, the RF model performs extremely well, with essentially perfect precision and only one false negative. Overall, the models are especially good at avoiding false negatives, which is important in a medicinal chemistry setting because false negatives could cause chemists to discard reactions that would actually work.

Having said that, this study represents something close to a best-case scenario for reaction-outcome prediction. The dataset is large by the standards of synthetic chemistry, with around 4000 systematically generated reactions. The reactions are all run under the same conditions, reducing experimental heterogeneity. The positive rate is also relatively high: about 35% of the reactions exceed the 20% LCAP threshold. This makes the classification task easier than many realistic discovery settings where successful reactions are much rarer. Finally, because the dataset is large and the hit rate is high, the models can make a substantial number of high-confidence predictions, which enables the construction of balanced validation libraries with 50% predicted successes and 50% predicted failures. In smaller, noisier, or more imbalanced datasets, this level of prospective performance would likely be much harder to achieve.


This work is licensed under a Creative Commons Attribution 4.0 International License.