A unified machine-learning protocol for asymmetric catalysis as a proof of concept demonstration using asymmetric hydrogenation

Edited by Michael W. Deem, Rice University, Houston, TX, and accepted by Editorial Board Member Tobin J. Marks December 10, 2019 (received for review September 24, 2019)
January 8, 2020
117 (3) 1339-1345


Development of suitable machine-learning (ML) approaches by using molecular descriptors can provide significant impetus to current efforts in asymmetric catalysis, wherein one strives to make a desired stereoisomer of a given handedness (enantiomer) in a highly selective manner. The proposed approach provides a sustainable model that trains on known catalysts and helps to predict the efficacy of additional catalysts for asymmetric synthesis, thereby expediting the discovery with lesser cost as compared to traditional empirical methods. Training ML algorithms first using the available known examples—predicting additional catalysts using such algorithms for subsequent in-line experimental validation, re-training by augmenting with the additional data thus generated—can provide a superior train–predict–train cycle suitable for accelerated reaction discovery.


Design of asymmetric catalysts generally involves time- and resource-intensive heuristic endeavors. In view of the steady increase in interest toward efficient catalytic asymmetric reactions and the rapid growth in the field of machine learning (ML) in recent years, we envisaged dovetailing these two important domains. We selected a set of quantum chemically derived molecular descriptors from five different asymmetric binaphthyl-derived catalyst families with the propensity to impact the enantioselectivity of asymmetric hydrogenation of alkenes and imines. The predictive power of the random forest (RF) built using the molecular parameters of a set of 368 substrate–catalyst combinations is found to be impressive, with a root-mean-square error (rmse) in the predicted enantiomeric excess (%ee) of about 8.4 ± 1.8 compared to the experimentally known values. The accuracy of RF is found to be superior to other ML methods such as convolutional neural network, decision tree, and eXtreme gradient boosting as well as stepwise linear regression. The proposed method is expected to provide a leap forward in the design of catalysts for asymmetric transformations.
The ever-growing requirement for high-purity chiral compounds in therapeutic applications serves as an impetus for continued development of asymmetric catalysts (1, 2).
Creation of new catalysts often involves tedious trial and error cycles driven by chemical intuition. Molecular insights on stereocontrol in transition states, obtained using modern electronic structure computations, have also contributed to this goal, although such methods are generally resource intensive (3, 4). Therefore, development of faster reliable catalyst design techniques that can guide future experiments is of high contemporary interest. Among the leading efforts, the application of mathematical modeling using multivariate stepwise linear regression (MLR) has recently emerged as a promising tool (5, 6). In particular, a fairly large number of molecular parameters are considered to identify the best regression equation that correlates important steric and electronic parameters to the stereochemical outcome. The tuning of the equations to obtain a compatible model in MLR can be challenging, as can incorporation of higher-order nonlinear terms. We believe that the burgeoning advances in the domain of machine learning could be exploited as a complementary optimization tool.
Diverse domains of modern science have already embraced the benefits of machine learning (ML) with an impressive degree of success (711). There have also been some important applications of ML in chemistry at large and catalytic reactions in particular (1227), such as for predicting selectivities (2836). We intend to design a practically useful ML protocol to be deployed for asymmetric hydrogenation of substrates bearing C = C and C = N bonds using axially chiral binaphthyl-derived catalyst families. Our ML approach makes use of molecular parameters of the catalysts and substrates that could have higher propensity to impact the enantioselectivity of the reaction. The chosen parameters are then used in random forest (RF) as well as other ML algorithms, which are well known for their predictive capabilities (37, 38). This study is a comprehensive application of a range of ML methods in asymmetric hydrogenation catalysis using enantiomeric excess (%ee) as the target value.

Choice of Reaction Type and Asymmetric Catalyst Families

A total of 368 known asymmetric hydrogenation reactions catalyzed by five different axially chiral binaphthyl catalyst families (encompassing BINOL-phosphite [L1], BINOL-phosphoramidite [L2], BINAP [L3], BINAP-O [L4], and BINOL-phosphoric acid [L5]) and a series of alkenes and imines were considered to examine the applicability of ML (Fig. 1A) (3942) (see SI Appendix, Tables S1–S4 for more details of catalyst and substrate families). The chemical diversity of catalysts and substrates considered in this study can be gleaned from Fig. 1. Both the reaction type and the catalyst families have been widely applied (43, 44) in enantioselective synthesis of important pharmaceutical and agrochemical compounds (4547). Although the number of samples appears rather modest from a ML perspective, the molecular parameters derived from each catalyst and reactant of this library proved sufficient to train the ML algorithms (see below).
Fig. 1.
Catalysts and substrates. (A) A generalized representation of catalyst families and substrates used in catalytic asymmetric hydrogenation. The number of reactions in each catalyst family is shown in parentheses. The number of catalysts and substrates in each case is given in square brackets. The atom-numbering scheme is shown for a representative member of the L1 family. A representative set of global molecular descriptors (e.g., sterimol parameters [L1, B1, B5] and rotational constants [RCx, RCy, RCz]) is shown. (B–D) Various substituents in (B) each catalyst family, (C) alkenes, and (D) imines are shown.
Recent density functional theory studies have explored the role of noncovalent interactions in the stereocontrolling transition states of several reactions (48), including organo-catalyzed and transition metal-catalyzed reactions of the binaphthyl family (4). Such molecular insights have highlighted the influence of geometric and electronic features of a catalyst in dictating the formation of the preferred enantiomer in asymmetric transformations. However, deconvolution of the subtle interdependencies of these features remains a major challenge.

Selection of Parameters for Machine Learning

We have chosen a set of molecular parameters for each catalyst and substrate from the respective minimum energy geometries obtained at the M06-2X/6–31G** level of theory in the condensed phase (49) (full details in SI Appendix, section V). These parameters include bond lengths (BL), bond angles (BA), dihedral angles (DA), distance between nonbonded atoms (NBL), and sterimol parameters that capture molecular dimension. The vibrational frequencies (VF) and the corresponding vibrational intensities (VI) of certain normal modes of vibration (5), chemical shifts (NMR), and charges (q) of various atoms obtained using the natural population analysis (NPA) were chosen as the electronic parameters. In addition to site-specific molecular properties, we have also included 22 global descriptors that represent certain overall molecular properties (e.g., HOMO and LUMO energies, dipole moment, polar surface area, etc.; SI Appendix, Table S5). The nature of the chosen parameters is critical to the quality of the model and its ability to predict the reaction outcome (14). Since we aim to develop a ML model suitable across five different axially chiral catalyst families, most parameters were chosen in such a way that they share equivalent/common core regions. The differences, such as that arising due to substituents, are treated using local parameters like sterimol (SI Appendix, section IV). A step-by-step tutorial for using the python codes for various ML algorithms employed in this study is provided in SI Appendix, section XVII.

Subset Details and Application of RF Method

The parameters mentioned above formed the necessary dataset for the construction of a random number of trees using the RF algorithm. The output of the RF is the desired quantity of interest, i.e., %ee. RF is an ensemble technique where multiple decision trees are combined to get better predictive performance. We built random forest regressors, with a split of 20% samples in the test and 80% in the training sets (shown as step 1 in Fig. 2A). To ensure the desirable technical rigor, 100 different test–train splits were constructed using random selection, and a fivefold cross-validation (see SI Appendix, section VI.2 for more details of cross-validation) for each of the 100 different training sets was carried out to identify the best hyperparameter combination (step 2 in Fig. 2A). The RF model built using this best hyperparameter combination was then used for the prediction on the out-of-bag samples from the test sets in each of the 100 runs corresponding, respectively, to the 100 test–train splits (step 3 in Fig. 2A). The average performance is measured and reported in terms of the root-mean-square error (rmse) over these 100 runs (see SI Appendix, Tables S16 and S17 for more details). The %ee of each substrate–catalyst combination is therefore predicted once or multiple times in this approach. The quality of a trained RF model was examined by comparing the predicted %ee with that of the experimental values for the samples in the test set, which were not part of the model building.
Fig. 2.
Subset details and performance of unified random forest. (A) A general representation of the common procedure for calculating the final rmse. (B) Different catalyst–substrate subsets and the corresponding rmse. (C) Absolute deviation between the experimental and predicted %ee for all of the 100 runs. Color shades of green, yellow, and red respectively depict the superior, moderate, and inferior quantitative agreements between the experimental values and that predicted by the RF. The best run, characterized by the lowest rmse, is shown within a white border. Absolute deviations between the experimental and RF predicted %ee in the range of 0 to 80 output values for all of the 100 runs are separately provided in SI Appendix, section XVII. (D) A zoomed-out representation of the best run to convey how many predictions are close (darker green boxes) to the actual values. (E) rmse comparison across various ML methods such as RF, DT, eXtreme GB, and CNN.
First, independent RF models were constructed for each of the five catalyst families. For instance, in the RF model for catalyst L1, only reactions corresponding to that family were considered for the formation of the training and test sets (Fig. 2B). The results of application of the RF models were encouraging for each of the five catalyst families, in that they showed a consistently low average deviation of just 5.4 in the predicted %ee for external samples. This indicates the capability of the RF to decipher the intricate relationship between the molecular parameters of the substrate–catalyst combination and the enantioselectivity. The trained RF models were efficiently able to capture the substrate diversity within each catalyst family. The rmses in the predicted %ee for the test sets compared to the experimental values were found to be 6.3 ± 2.4 (L1), 6.5 ± 1.3 (L2), 9.2 ± 3.3 (L3), 8.2 ± 4.6 (L4), and 7.1 ± 1.2 (L5) (Fig. 2B), which engender good confidence that the RF models for various catalyst–substrate combinations within a catalyst family (see SI Appendix, Table S18 for more details).
Thus far, we have assessed the competence of our trained RF model to predict from within a set of experimentally known catalysts. In other words, the predicted %ee for a set of randomly selected catalyst–substrate combinations showed very good agreement with the reported experimental values. It is important to note that %ee for the test set was unknown to the RF. This situation, in spirit, is comparable to prediction ahead of experimental verification. Along these lines, a key question pertaining to the utility of RF in asymmetric catalysis is worth considering. This is to examine whether a unified RF for all five catalyst families could be developed to predict the %ee for any catalyst–substrate combination.

Generalization to Diverse Catalyst–Substrate Combinations Belonging to Different Catalyst Families

To develop a unified model, we considered catalyst–substrate combinations of different catalyst families toward building an additional set of RF models. The trained RF is then used in the prediction of %ee for an external test set drawn from more than one catalyst family. On the basis of the broad structural similarity (Fig. 2B) between L1 and L2, these families were bundled together first. Interestingly, in quantitative terms, rmse of 8.5 ± 3.2 (for L1L2) was noticed between the predicted and experimental %ees, when the test set consists of a randomized mix of catalyst–substrate samples drawn from L1 and L2 families. The results here are more promising compared to our first approach where a different RF was developed for different catalyst families. The vital clue that a RF could be successful in dealing with such diversity in the catalyst–substrate combinations prompted us to probe the potential of our approach further.
Next, a more diverse set of substrate–catalyst combinations consisting of L1, L2, and L5 was considered for developing an additional RF model. The rmse in %ee for external predictions is 6.8 ± 0.8. Another RF model was built, by including examples from the BINAP-O catalyst family (L4), which offers a visible diversity compared to structurally similar L1, L2, and L5. This expanded training set also provided high accuracy in predictions with an rmse of 8.6 ± 2.0. In the final RF model, all 368 reactions were considered. This unified RF comprises 58 catalysts drawn from five different families and 190 diverse collections of alkenes and imines as the substrates. Gratifyingly, predictions were good with an rmse of only 8.4 ± 1.8 (Fig. 2 C and D). In every run, prediction for 73 samples was made, resulting in several thousand predictions over 100 runs (the same samples might have naturally been predicted multiple times). Most of these predictions are in excellent agreement with the experimental %ees, leading to a dominance of the green-colored pixels in Fig. 2C. In the best run (Fig. 2D), only 6 of 73 predictions showed an error in %ee of 15 or higher. Alternatively, an rmse of 8.4 ± 1.8 implies that %ees of an overwhelming majority of 65 samples of 73 in a typical run are within 10 units of the actual %ee. It is important to note that despite small sample size, our models performed very well for individual subsets and their combinations, wherein the sample size varies from as small as 39 to as big as 368.
In any endeavor in asymmetric catalysis, one strives to accomplish superior %ees resulting in acute shortage of lower target values as the latter is considered an inferior result by the experimental organic chemists. To address such an inevitable class imbalance issue, we have also retrained our ML algorithms by additional inclusion of synthetic data, generated using the synthetic minority oversampling technique (SMOTE). New minority data (corresponding to the output %ee range where many fewer samples were initially present) are created between the existing minority data points (for more details, see SI Appendix, section VI.1). Interestingly, we could obtain excellent test and train rmses while using a pure dataset or with the inclusion of synthetic data with the original data. Nearly overlapping rmses for the test and train sets, for the top four best-performing methods such as the random forest, are evidence for minimal overfitting (SI Appendix, Fig. S4 and Tables S16 and S17). Further, the difference in rmse across different methods while using real and synthetic data together did not show any large variation compared to the use of only real data. This is encouraging in that for the current study, the class imbalance has not led to any notable issue in our predictive capabilities.
Next, a comparison of our RF model with other commonly used ML techniques was carried out to assess their relative performance (SI Appendix, Tables S6 and S17 and Figs. S1–S4). The comparable rmses of 9.2 ± 1.9, 9.6 ± 1.9, and 11.6 ± 2.8 were obtained, respectively, for decision tree (DT), extreme gradient boosting (GB), and convolutional neural networks (CNN) (Fig. 2E). These results suggest that the complex ML algorithms are able to decipher the interdependencies between the chemical descriptors of the diverse catalyst–substrate combinations and the enantioselectivity in asymmetric hydrogenation reactions.
In the developmental phase of asymmetric catalysis, it is desirable to have higher enantioselectivities, which are typically measured in terms of the area under the curve obtained using a suitable chiral column in a high-performance liquid chromatography run. Subsequent and more involved methods (e.g., X-ray crystallography) might be required to assign the absolute configuration of the newly generated stereogenic centers. We examined whether different ML approaches described in the previous sections to predict the extent of enantioselectivity could be used for predicting the absolute sense of stereoinduction as well. Indeed, a good accuracy of 84.29 ± 3.6% with RF was achieved. In the best run, an overwhelming majority of 66 samples of 69 were predicted correctly (SI Appendix, section XVIII).
In an effort to assess the importance of such chemical descriptors in asymmetric hydrogenation, we employed a two-pronged approach. First, we randomly shuffled the chemical descriptors between samples to generate another dataset wherein none of the true descriptors of a given sample is associated with the sample itself (SI Appendix, section XIII.2). In other words, all of the chemical descriptors of a given sample actually belong to some other randomized sample. In another approach, we generated arbitrary descriptors that follow a normal distribution. None of the numerical values generated through this approach has any chemical significance or resemblance to the actual chemical descriptors (SI Appendix, section XIII). In both these approaches, substantially poorer performance was obtained compared to the originally employed chemically meaningful descriptors, thus serving as evidence of the importance of chemical descriptors that we have employed in this study (for more details, see SI Appendix, Tables S25 and S26). (An interesting example on the importance of chemical descriptors in a random forest study can be found in ref. 50.) We also performed correlation analysis on the feature matrix to identify the potential interdependencies between various features. After removing the correlated features, we trained another RF model with 60 features and the test rmse was found to be only marginally higher compared to the performance obtained using 101 features (for more details, see SI Appendix, Tables S21 and S23).
We believe that a synergistic combination of experiments and ML-based analysis can be exploited toward accelerating the ongoing developments in asymmetric catalysis. It would be interesting to probe whether the trained ML models are capable of predicting for unseen samples drawn from different axially chiral catalysts. To this end, we have chosen two additional catalyst–substrate combinations that resemble the original L2 and L3 series (SI Appendix, Table S28). One of these two groups of catalysts contains a binaphthyl backbone with fused saturated B rings (51) and the other is a biphenyl phosphine (52) (denoted, respectively, as L2′ and L3′ in Fig. 3A). The RF is now trained using all 368 samples (full dataset used in the previous calculations) and the trained RF model is used for predicting on 43 unseen samples, which form the additional test set (SI Appendix, Tables S29 and S30). As shown in Fig. 3B, very good predictions with an rmse of 8.5 ± 0.0 could be obtained for these sets of catalysts. In addition, we examined our ML model for across-catalyst class applicability by predicting for L4L5 catalysts (SI Appendix, Tables S30 and S31) by using training on L1L2L3 catalysts. Although lower predictive performance is noted in across-catalyst class trials, the origin of inferior predictions could be traced to certain outliers, which in turn are due to class imbalance in the data distribution. Interestingly, outlier removal (as low as 1 of 39 predictions and as high as 13 of 167 predictions) resulted in significantly improved predictions in the across-catalyst class as well. These results further endorse the general applicability of our ML model for axially chiral asymmetric hydrogenation catalysts.
Fig. 3.
Predictions on 43 out-of-bag samples using the random forest algorithm trained on the full dataset of 368 samples. (A) A generalized representation of catalysts and substrates used in the test set. (B) The difference between the reported experimentally observed and predicted percentage of enantiomeric excess across all test sets. Color shades of green, yellow, and red respectively depict the superior, moderate, and inferior quantitative agreements with experimentally reported %ee and that predicted by the RF.

Identification of Chemically Relevant Patterns Using Decision Tree

In view of the inherent black-box–like aspects in ML, a DT-based analysis (53, 54) was performed on all 368 reactions to examine whether we could derive better chemical insights. This rationale deserves special attention that enantioselectivity is affected by the interactions between the catalyst and substrate, irrespective of the mechanism of the reaction. The catalyst–substrate interactions are captured primarily through the geometric and electronic descriptors used in this study. Certain combinations of geometric and electronic features of the catalyst and/or substrate might have a higher impact on the %ee. This analysis can also provide logical guidelines on how subtle variations in the molecular features can help fine-tune the %ee. Following the standard procedure, 20% of samples were kept aside as the holdout set in each run wherein a critical hyperparameter such as “max_depth” was varied to identify the best tree (SI Appendix, Table S9). The best DT obtained from 100 independent runs is presented in Fig. 4.
Fig. 4.
Decision tree analysis for 368 reactions. The discriminating attribute at the higher level in a decision tree has a more pronounced impact on the outcome while the lower attributes tend to exert differing influences depending on the preceding set of attributes in that branch. The paths shown in red convey the combination of descriptors for the most promising substrate–catalyst combinations. Refer to Fig. 1A for atom numbering. The parameter values are in their respective standard units.
Certain interesting details emerged from the DT analysis. The appearance of vibrational intensity of the substrate (VI12-16-S) as the root node conveys the critical importance of the electronic parameter on the %ee. This parameter can be varied by way of introducing suitable substituents on the alkene/imine moiety. Another aspect of this DT study relates to the path shown using the red line in Fig. 4, which is intended to highlight how %ee of an out-of-bag sample could be predicted by examining its molecular parameters. For instance, if a substrate has its C = C or C = N stretching intensity (VI12-16-S) greater than 14.1 and the 13C chemical shift (NMR11) less than 39.6, it is likely to yield high %ee (>92%). The next important decision node is the biaryl dihedral angle (DA6-5-12-11) followed by (VI5-12); both these features home in on the region of axial chirality of the binaphthyl core. More importantly, dihedral angles such as (DA6-5-12-11) can be tuned through suitable substituents to modulate the extent of enantioselectivity (48). An interesting bond angle (BA5-6-34) is identified that offers high %ee when its values are greater than 118.6. This is a very important feature considering that the present study encompasses five different catalysts. The two last nodes in the decision tree are the features such as the bond angle (BA12-11-16) and volume, both belonging to the catalyst. The emergence of the electronic parameters of the catalyst ((NMR11), (VI5-12), (BA12-11-16) and (volume)) in conjunction with that of the substrate ((VI12-16-S), (dipole-moment S)) can be regarded as an indication of the importance of catalyst–substrate interactions in the stereocontrolling transition states (55). We also performed the DT analysis with 60 features obtained after the correlation analysis. It is interesting to note that all these features appearing as important features in the DT are not correlated to the original 101 features (SI Appendix, Table S24). Other analyses to gauge the relative importance of various molecular parameters by using parameter ranking (SI Appendix, Tables S19 and S20 and Figs. S5–S13) and partial least-squares (PLS) analysis identified almost the same set of parameters as the high-ranked ones that appeared as important nodes in DT (SI Appendix, Fig. S22) (DTs for subsets are provided in SI Appendix, Figs. S14–S21). While we note that the interpretation of important features and their implications in %ee, mentioned above, is broadly useful, the identification of the most appropriate combination of features can become insurmountable for a given system. Hence, we acknowledge that the lack of explicit inclusion of a causative feature might even result in judging another feature, which is correlated to the causative feature, as important. In such situations, physical interpretation of those correlated features would be convoluted.


We have demonstrated a proof-of-concept application of machine learning in the domain of asymmetric catalysis. Harnessing this high-throughput method, the discovery of asymmetric catalysts could be accelerated by using ML tools like random forest and decision tree. The RF was able to make accurate prediction of enantioselectivities with an rmse in %ee of just 8.4 ± 1.8 compared to the experimental values, indicating its practical utility toward identifying lead candidates for catalytic asymmetric applications. The trained RF has been able to predict correctly a whole range of %ees in complex situations when both substrate and catalyst are from outside the training set, thus indicating a scope for a potential breakthrough in the discovery of asymmetric catalysts as well as in making an informed choice of substrates for a particular catalyst. The RF approach could have far-reaching implications in studying and expanding the asymmetric catalyst and substrate libraries. We believe that the leads emerging through machine learning on what combination of catalysts and substrate(s) has better propensity to be successful could be coupled with automated experimental protocols. Our approach can also be exploited in a broad range of asymmetric reactions and thus can open up promising avenues toward cost-effective and efficient design of asymmetric catalysts.

Data Availability

All data discussed in this paper are available to readers.

Data Availability

Data deposition: The data reported in this paper have been deposited in GitHub, https://github.com/Sunojlab/ML-for-Asymmetric-Catalysis.


Generous computing time from SpaceTime supercomputing facility at Indian Institute of Technology (IIT) Bombay is acknowledged. M.P. is grateful to Council of Scientific and Industrial Research, New Delhi and A.C. and B.B. acknowledge University Grants Commission, New Delhi for Senior Research Fellowships. We acknowledge Prof. Preethi Jyothi (Department of Computer Science and Engineering, IIT Bombay) and Soumi Tribedi (Department of Chemistry, IIT Bombay) for valuable discussions and automation of parameter extraction, respectively, during the course of this project.

Supporting Information

Appendix (PDF)
Dataset_S01 (XLSX)


P. J. Walsh, M. C. Kozlowski, Fundamentals of Asymmetric Catalysis (University Science Books, 2008).
M. S. Taylor, E. N. Jacobsen, Asymmetric catalysis in complex target synthesis. Proc. Natl. Acad. Sci. U.S.A. 101, 5368–5373 (2004).
M. S. Sigman, K. C. Harper, E. N. Bess, A. Milo, The development of multidimensional analysis tools for asymmetric catalysis and beyond. Acc. Chem. Res. 49, 1292–1301 (2016).
Y. H. Lam, M. N. Grayson, M. C. Holland, A. Simon, K. N. Houk, Theory and modeling of asymmetric catalytic reactions. Acc. Chem. Res. 49, 750–762 (2016).
J. P. Reid, M. S. Sigman, Comparing quantitative prediction methods for the discovery of small-molecule chiral catalysts. Nat. Rev. Chem. 2, 290–305 (2018).
A. Milo, A. J. Neel, F. D. Toste, M. S. Sigman, Organic chemistry. A data-intensive approach to mechanistic elucidation applied to chiral anion catalysis. Science 347, 737–743 (2015).
Z. W. Ulissi et al., Machine-learning methods enable exhaustive searches for active bimetallic facets and reveal active site motifs for CO2 reduction. ACS Catal. 7, 6600–6608 (2017).
K. Tran, Z. W. Ulissi, Active learning across intermetallics to guide discovery of electrocatalysts for CO2 reduction and H2 evolution. Nat. Catal. 1, 696–703 (2018).
B. Sanchez-Lengeling, A. Aspuru-Guzik, Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360–365 (2018).
S. M. Moosavi et al., Capturing chemical intuition in synthesis of metal-organic frameworks. Nat. Commun. 10, 539 (2019).
P. V. Balachandran, B. Kowalski, A. Sehirlioglu, T. Lookman, Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nat. Commun. 9, 1668 (2018).
D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher, A. G. Doyle, Predicting reaction performance in C-N cross-coupling using machine learning. Science 360, 186–190 (2018).
J. M. Granda, L. Donina, V. Dragone, D. L. Long, L. Cronin, Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature 559, 377–381 (2018).
G. Skoraczyński et al., Predicting the outcomes of organic reactions via machine learning: Are current descriptors sufficient? Sci. Rep. 7, 3582 (2017).
Y. Zhuo, A. Mansouri Tehrani, A. O. Oliynyk, A. C. Duke, J. Brgoch, Identifying an efficient, thermally robust inorganic phosphor host via machine learning. Nat. Commun. 9, 4377 (2018).
J. N. Wei, D. Duvenaud, A. Aspuru-Guzik, Neural networks for the prediction of organic chemistry reactions. ACS Cent. Sci. 2, 725–732 (2016).
Z. W. Ulissi, A. J. Medford, T. Bligaard, J. K. Nørskov, To address surface reaction network complexity using scaling relations machine learning and DFT calculations. Nat. Commun. 8, 14621 (2017).
J. R. Kitchin, Machine learning in catalysis. Nat. Catal. 1, 230–232 (2018).
M. I. Jordan, T. M. Mitchell, Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015).
K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for molecular and materials science. Nature 559, 547–555 (2018).
F. Brockherde et al., Bypassing the Kohn-Sham equations with machine learning. Nat. Commun. 8, 872 (2017).
Z. Zhou, X. Li, R. N. Zare, Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3, 1337–1344 (2017).
R. Gómez-Bombarelli et al., Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
S. Szymkuć et al., Computer-assisted synthetic planning: The end of the beginning. Angew. Chem. Int. Ed. Engl. 55, 5904–5937 (2016).
B. Liu et al., Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci. 3, 1103–1113 (2017).
M. H. S. Segler, M. Preuss, M. P. Waller, Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
P. S. Gromski, A. B. Henson, J. M. Granda, L. Cronin, How to explore chemical space using algorithms and automation. Nat. Rev. Chem. 3, 119–128 (2019).
A. F. Zahrt et al., Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363, eaau5631 (2019).
A. Tomberg, M. J. Johansson, P. O. Norrby, A predictive tool for electrophilic aromatic substitutions using machine learning. J. Org. Chem. 84, 4695–4703 (2019).
J. Aires-de-Sousa, J. Gasteiger, New description of molecular chirality and its application to the prediction of the preferred enantiomer in stereoselective reactions. J. Chem. Inf. Comput. Sci. 41, 369–375 (2001).
J. Aires-de-Sousa, J. Gasteiger, Prediction of enantiomeric excess in a combinatorial library of catalytic enantioselective reactions. J. Comb. Chem. 7, 298–301 (2005).
J. Chen, W. Jiwu, L. Mingzong, T. You, Calculation on enantiomeric excess of catalytic asymmetric reactions of diethylzinc addition to aldehydes with topological indices and artificial neural network. J. Mol. Catal. A Chem. 258, 191–197 (2006).
Q. Y. Zhang, D. D. Zhang, J. Y. Li, H. L. Long, L. Xu, Prediction of enantiomeric excess in a catalytic process: A chemoinformatics approach using chirality codes. MATCH Commun. Math. Comput. Chem. 67, 773–786 (2012).
P. J. Donoghue, P. Helquist, P. O. Norrby, O. Wiest, Prediction of enantioselectivity in rhodium catalyzed hydrogenations. J. Am. Chem. Soc. 131, 410–411 (2009).
W. Beker, E. P. Gajewska, T. Badowski, B. A. Grzybowski, Prediction of major regio-, site-, and diastereoisomers in diels–alder reactions by using machine-learning: The importance of physically meaningful descriptors. Angew. Chem. Int. Ed. Engl. 58, 4515–4519 (2019).
A. F. Zahrt, S. E. Denmark, Evaluating continuous chirality measure as a 3D descriptor in chemoinformatics applied to asymmetric catalysis. Tetrahedron 75, 1841–1851 (2019).
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015).
V. Svetnik et al., Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003).
W. Tang, X. Zhang, New chiral phosphorus ligands for enantioselective hydrogenation. Chem. Rev. 103, 3029–3070 (2003).
J. P. Reid, L. Simón, J. M. Goodman, A practical guide for predicting the stereochemistry of bifunctional phosphoric acid catalyzed reactions of imines. Acc. Chem. Res. 49, 1029–1041 (2016).
M. T. Reetz, G. Mehler, Highly enantioselective Rh-catalyzed hydrogenation reactions based on chiral monophosphite ligands. Angew. Chem. Int. Ed. Engl. 39, 3889–3890 (2000).
A. J. Minnaard, B. L. Feringa, L. Lefort, J. G. de Vries, Asymmetric hydrogenation using monodentate phosphoramidite ligands. Acc. Chem. Res. 40, 1267–1277 (2007).
J. F. Teichert, B. L. Feringa, Phosphoramidites: Privileged ligands in asymmetric catalysis. Angew. Chem. Int. Ed. Engl. 49, 2486–2528 (2010).
J. A. F. Boogers et al., A mixed-ligand approach enables the asymmetric hydrogenation of an α-isopropylcinnamic acid en route to the renin inhibitor aliskiren. Org. Process Res. Dev. 11, 585–591 (2007).
P. Etayo, A. Vidal-Ferran, Rhodium-catalysed asymmetric hydrogenation as a valuable synthetic tool for the preparation of chiral drugs. Chem. Soc. Rev. 42, 728–754 (2013).
D. J. Ager, A. H. M. de Vries, J. G. de Vries, Asymmetric homogeneous hydrogenations at scale. Chem. Soc. Rev. 41, 3340–3380 (2012).
P. C. J. Kamer, P. W. N. M. van Leeuwen, Phosphorus(III) Catalysts in Homogeneous Catalysis: Design and Synthesis (Wiley-VCH, 2012).
S. E. Wheeler, T. J. Seguin, Y. Guan, A. C. Doney, Noncovalent interactions in organocatalysis and the prospect of computational catalyst design. Acc. Chem. Res. 49, 1061–1069 (2016).
M. J. Frisch et al., Gaussian 09, D.01 (Gaussian, Wallingford, CT, 2009).
J. G. Estrada, D. T. Ahneman, R. P. Sheridan, S. D. Dreher, A. G. Doyle, Response to Comment on “Predicting reaction performance in C–N cross-coupling using machine learning.” Science 360, 186–190 (2018).
D. J. Nelson, R. Li, C. Brammer, Using correlations to compare additions to alkenes: Homogeneous hydrogenation by using Wilkinson’s catalyst. J. Org. Chem. 70, 761–767 (2005).
T. Morimoto, K. Yoshikawa, M. Murata, N. Yamamoto, K. Achiwa, Preparation of axially chiral biphenyl diphosphine ligands and their application in asymmetric hydrogenation. Chem. Pharm. Bull. (Tokyo) 52, 1445–1450 (2004).
P. Raccuglia et al., Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
L. E. O. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001).
T. Lu, S. E. Wheeler, Organic chemistry. Harnessing weak interactions for enantioselective catalysis. Science 347, 719–720 (2015).

Information & Authors


Published in

Go to Proceedings of the National Academy of Sciences
Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 117 | No. 3
January 21, 2020
PubMed: 31915295


Data Availability

Data deposition: The data reported in this paper have been deposited in GitHub, https://github.com/Sunojlab/ML-for-Asymmetric-Catalysis.

Submission history

Published online: January 8, 2020
Published in issue: January 21, 2020


  1. asymmetric catalysis
  2. machine learning
  3. computational chemistry


Generous computing time from SpaceTime supercomputing facility at Indian Institute of Technology (IIT) Bombay is acknowledged. M.P. is grateful to Council of Scientific and Industrial Research, New Delhi and A.C. and B.B. acknowledge University Grants Commission, New Delhi for Senior Research Fellowships. We acknowledge Prof. Preethi Jyothi (Department of Computer Science and Engineering, IIT Bombay) and Soumi Tribedi (Department of Chemistry, IIT Bombay) for valuable discussions and automation of parameter extraction, respectively, during the course of this project.


This article is a PNAS Direct Submission. M.W.D. is a guest editor invited by the Editorial Board.



Sukriti Singh1
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;
Monika Pareek1
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;
Avtar Changotra1
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;
Bangaru Bhaskararao
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;
P. Balamurugan2 [email protected]
Industrial Engineering and Operations Research, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India
Department of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India;


To whom correspondence may be addressed. Email: [email protected] or [email protected].
Author contributions: P.B. and R.B.S. designed research; S.S., M.P., A.C., S.B., and B.B. performed research; S.S., M.P., A.C., and S.B. analyzed data; and S.S., M.P., A.C., S.B., P.B., and R.B.S. wrote the paper.
S.S., M.P., and A.C. contributed equally to this work.

Competing Interests

The authors declare no competing interest.

Metrics & Citations


Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements



If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by


    View Options

    View options

    PDF format

    Download this article as a PDF file


    Get Access

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to get full access to it.

    Single Article Purchase

    A unified machine-learning protocol for asymmetric catalysis as a proof of concept demonstration using asymmetric hydrogenation
    Proceedings of the National Academy of Sciences
    • Vol. 117
    • No. 3
    • pp. 1237-1819







    Share article link

    Share on social media