Machine-learning approach to the design of OSDAs for zeolite beta

Significance Zeolite beta is one of the top-six zeolites of commercial interest. It has been synthesized through the use of a number of organic structure directing agents (OSDAs). Pure zeolite beta A has not yet been synthesized, nor has chiral zeolite beta A. We here report a machine-learning strategy to aid the computational design of chemically synthesizable OSDAs for zeolite beta A. The use of machine learning speeds up the computation by a factor of 350. Through de novo materials design runs, a total of 3,062 promising OSDAs were identified, and 469 OSDAs were computed to stabilize the structure of zeolite beta A better than known compounds.


Neural network
We used machine learning to relate three-dimensional structures of BEA OSDAs to their stabilization energies. A neural network was trained on the descriptors of molecular structure of OSDAs to predict stabilization energy. The structure of the neural network is shown in Figure  S1A. There was one hidden layer between the input and output layer. Sigmoid activation was adopted on both the hidden layer and the output layer. Given the simple structure of the neural network, it could be trained in a very short time. The trained neural network can predict stabilization energies much faster than can the MD calculation, enabling much more efficient discovery of promising new OSDAs.
The structural descriptors were obtained through the 3D-MoRSE (Molecule Representation of Structures based on Electron diffraction) code (1). This method generates descriptors that encode 3D molecular structure by sampling a calculated diffraction pattern. Each OSDA was optimized geometrically, with its lowest energy geometry chosen using a genetic algorithm for conformation search (GACS). The MoRSE descriptors were calculated on the optimized geometry as follows: where s is the scattering parameter, which is sampled on a grid. Each s value will produce one intensity, I, which is one descriptor. Therefore, fixing the number of s values will fix the number of MoRSE descriptors. Here is the Euclidean distance between ith and jth atoms, N is the total number of atoms, and and are weights for ith and jth atoms. In our specific model, is chosen as the van der Waals radius for the ith atom. Using equation S1, the 3D structure of an OSDA was represented by a list of easily calculable I values.
The samples of OSDAs for training, testing, and validating the neural network consist of 4781 putative BEA OSDAs that we have obtained in our search for OSDAs for pure BEA and chiral BEA zeolite in the past five years. In this search, our procedure consisted of first designing putative small, achiral 'monomer' OSDAs and then finding suitable chiral linkers to dimerize these (2). We here use these monomers for training a neural network. To obtain good scoring monomer OSDAs for BEA we have used 3 strategies: de novo design, virtual screening, and virtual combinatorial chemistry. A de novo design algorithm (3,4) was used to generate many putative BEA OSDAs. Analogs of the highest scoring hits were selected from the available building block databases in eMolecules (https://reaxys.emolecules.com/) and Chemspace (https://chem-space.com/). Finally, we extended this set by generating alkylated derivatives. In this way, we have obtained 4781 putative BEA OSDAs with predicted stabilization energies between -20. and 0. kJ/(mol Si). These OSDAs consists of a set of 3875 uncharged molecules and a set of 906 molecules that contain one or several charged N atoms.
To train the neural network, we first shuffled the full set of 4781 samples and randomly selected 20% as the validation set and the remaining 80% as the training and test set ( Figure S1B). We normalized the 3D MoRSE intensities so that each descriptor (I value) had a mean of zero and a standard deviation of one across all samples in the training and test set. The normalized I values are denoted as ̃ in Figure S1A. The substracted mean � 1 , 2 , … , � and the original standard deviation � 1 , 2 , … , � were recorded for later use. During training, the stabilization energies calculated from molecular dynamics (MD) were used as the "ground truth" values for each sample. These "ground truth" values were scaled, since sigmoid activation on the output layer predicts values of � between 0 and 1, instead of real energy values. The scaling was carried out as follows: where ( ) is the stabilization energy calculated from molecular dynamics on OSDA i, and min ( max ) is the minimum (maximum) of all MD-calculated stabilization energies of the training and test set. The values of min and max were recorded for later use. Similar to the normalization of input I values, this scaling was done without the validation set. The idea is that the validation set will serve as the final test of the neural network's performance and is deemed "unseen" throughout the network training and hyperparameter tuning process. We also tested a variation of the neural network design, where the output layer uses a linear activation. In this case, the prediction � has no value limit and was a direct prediction of the MD-calculated stabilization energy, ( ) = ( ) .
Given a training set of m samples, the cost function is defined using weighted root mean square error (RMS error, or RMSE): where � ( ) is the predicted scaled stabilization energy of OSDA i, ( ) is the scaled MDcalculated stabilization energy of OSDA i, and ( ) is the weight for OSDA i. We experimented with two protocols for the weight ( ) . The first protocol was ( ) ≡ 1, for all = 1,2, … , m. Equation S3 is then the regular root mean square error. The second protocol was ( ) = 10, if ( ) ≤ −15 kJ/(mol Si) and ( ) = 1 otherwise. OSDAs with energy below -15 kJ/(mol Si) were regarded as good candidates, and we wanted to ensure that the predictions for these OSDAs was as accurate as possible. The factor of ten difference in ( ) ensured that a poor prediction on OSDAs with favorable energies was heavily penalized. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm was used to minimize the root mean square error and to obtain the final weights of the neural network.
In the training process, we randomly selected 80% of the training and test set as the training set and the remaining 20% as the test set ( Figure S1B). The neural network was trained on the training set. Tunable hyperparameters of the neural network include the maximum s value in equation S1, max , increment, ∆ , and number of nodes, ℎ, in the hidden layer. For each set of hyperparameters, this random sampling and training process was repeated 30 times, producing 30 trained neural networks (NN 1 through NN 30 in Figure S1B).
To select the best-performing hyperparameters max , ∆s, and h, we evaluated the performance of neural networks on test sets. For each of the 30 trained neural networks, we calculated the test set root mean square error: Where RMSE test, is the root mean square error of the jth test set and test, is the number of samples in this test set. Here � ( ) is the predicted stabilization energy of sample i using the jth neural network (NN j ), which was trained on the jth training set. If linear activation was used on the output layer of the neural networks, we have � = �. Otherwise, an inverse transform of equation S2 was carried out to obtain the unscaled predicted energy: � = min + �( max − min ). The overall performance of a given set of hyperparameters is described by the average test RMSE: In the end, we chose the set of hyperparameters with the smallest RMSE test ������������ . We also applied a similar measure to the training sets: where RMSE training, is the root mean square error of the jth training set, defined similar to equation S4. Note that RMSE training, is also the cost function being minimized during training of the jth network (equation S3).
Finally, we validated the trained neural networks. For each sample in the validation set, we first normalized the I values: ̃= ( − )⁄ , = 1,2, … , . The and were the previously recorded mean and standard deviation from the training and test set. The normalized MoRSE descriptors were then input to each of the 30 neural networks with the optimized hyperparameters, producing 30 � values for each OSDA in the validation set. Again, if linear activation was used on the output layer of the neural networks, we have � = �. Otherwise, an inverse transform of equation S2 was carried out to obtain the unscaled predicted energy: � = min + �( max − min ). Similar to and , min and max are the previously recorded minimal and maximal energy from the training and test set. For the ith OSDA, the average of the 30 � values is taken as the final prediction of this OSDA's unscaled stabilization energy: where � ( ) is the prediction of OSDA i by the jth neural network. The root mean square error on the validation set was calculated to evaluate the neural network's performance: To compare the neural networks' performance on the validation set and the training and test set, we also calculated the overall root mean square error for the training and test set combined:

In silico materials design
The validated neural networks were then incorporated into our in silico materials design approach to accelerate the stabilization energy calculation. Out materials design approach is a de novo design program that searches and generates synthesizable molecules with desirable properties. Through a genetic algorithm, this method searches the chemical space defined by a list of predefined well-documented organic chemistry reactions and a user-supplied database of commercially available reagents. The output is a set of molecules that scored well on the scoring function, and their synthesis route.
The score function used for the design of BEA OSDAs is summarized in Table S1. First, it was verified that the molecule to be scored was amenable to molecular mechanics minimization with the force field used. Then the total number of rotatable bonds, the largest number of consecutive sp3-sp3 rotatable bonds, the presence of atoms other than C, N or H, the presence of triply bonded C, and the ratio of C atoms to charged N atoms were calculated. These properties can all be deduced from the molecular topology and are computationally trivial to obtain. If all of these fall within their respective thresholds, a locally optimal conformation of the molecule was calculated and the molecular volume was obtained. If this fell within its threshold, a conformational search was performed to obtain the global minimal energy conformation of the molecule using GACS. This conformation was used either as a starting point for the molecular dynamic procedure to obtain the stabilization energy in the zeolite structure, or to calculate the 3D-MoRSE score to be input into the neural network. Here we chose the latter.
The set of reactions used to synthesize virtual molecules presently consists of 100 organic chemistry reactions. The database of reagents we used contains 39 500 commercially available chemicals. To start the run, we randomly selected reactions, reagents and tree depths to generate the initial population of molecules. Here, tree depth is defined as the number of reactions that take place to form one solution molecule. This depth is usually constrained between 3 and 5. The population size is fixed at pop = 100, and every generated molecule was scored. It is possible that some molecules do not pass the scoring filters (see Table S1) and therefore do not have the molecular volume or stabilization energy calculated.
This population was sorted using Pareto optimization (5). First, solution molecules that are not dominated by any other solution molecules were placed as in the first Pareto front. With molecules belonging to the first Pareto front removed, the second Pareto front was identified from the remaining population using the same criterion. The third Pareto front was identified with the first and second Pareto front removed. This process continued until all molecules had been ranked. Each Pareto front satisfied the definition that no molecule was dominated by any other molecule in the front. The definition of dominance is as follows: Molecule i dominates molecule j, if scoring of molecule i is no worse than scoring of molecule j in all objectives, AND scoring of molecule i is strictly better than scoring of molecule j in at least one objective.
For the binary type score "force field compatibility," compatible is better than incompatible. For threshold type score, e.g. molecular volume, a passed score is better than a failed score, and a failed score closer to the threshold is better than a failed score farther from the threshold. Note that there is no superiority between two passed scores with different values. For minimize type score, the smaller value is better than the larger value.
Multiple molecules can exist within each Pareto front. Secondary sorting within each front was carried out based on the order in which molecules were generated and entered the population. As younger molecules have more potential to explore new areas in the chemical search space, they are placed ahead in this sorting.
After the population had been completely sorted, one of the six operators was applied to form a child molecule: (1) 'add': choose a parent molecule from the current population. Add a randomly chosen reaction step to the reaction path leading to the parent. Additional reagents needed for the new reaction are randomly selected from the reagent database.
(2) 'cut': choose a parent molecule from the current population. Remove the last reaction step from the reaction path leading to the parent.
(3) 'replace random': choose a parent molecule from the current population. Replace one reagent in the reaction path leading to the parent. The new reagent is randomly selected, but must have the required functionality.
(4) 'replace like': same as (3), except the new reagent must have a minimal preset chemical similarity to the reagent being replaced.
(5) 'combine': choose two parent molecules from the current population. The parents are searched for chemical functionalities that allow them to participate in a reaction. If no such functionality is detected, two other parents are selected. If more than one functionality is detected, one of the allowed reactions is selected at random.
To select parents, we used tournament selection: a set of tour molecules was randomly selected, and the best-ranking individual was chosen as the parent. If greater than one parent was needed, the randomly selection of tour molecules was carried out multiple times. Here, we used tour = 2.
The child molecule was scored and then compared to the worst-ranking molecule in the population. If the child molecule dominated the worst-ranking molecule, the child replaced the worst-ranking molecule, otherwise the child was discarded. If the replacement took place, Pareto optimization was run again to re-rank every molecule in the evolved population. The above 'select parent -produce child -update population' step was repeated 1 000 000 times. Putative OSDAs with energies below 0 kJ/(mol Si) were retrained for further analysis.

Principal Component and Principal Coordinate Analysis
To visualize sets of molecules in the reduced space of 3D-Morse intensities, principal component analysis (PCA) was used. The data matrices of intensities were centered and standardized and principal components were extracted using the NIPALS algorithm (6).
To visualize sets of molecules generated by the in silico materials design with ML models of differing meta parameters, principal coordinate analysis (PCOA) (7) was applied to distance matrices obtained from the 2-D similarity indices of sets of molecules. Molecules were represented with a binary fingerprint of size 2115 based upon the patterns of up to 4 atoms bound to a central atom, using the MMFF forcefield atom types as atom descriptors. From these fingerprints, a Tanimoto similarity (8) coefficient T was obtained, and 1-T was used as a distance metric for input into the PCOA. shown in Figure S2 that the RMSE traınıng ���������������� continues to decrease with the number of hidden nodes in the network, whereas there is an optimal number of hidden nodes for RMSE test ������������ due to slight overfitting past that point. The optimal number of hidden nodes for the other models was determined in the same way. In addition, it is possible that neural networks overfit to the train and test set and perform poorly on the independent validation step. By examining the best models from Table 1 (see main text), we find that RMSE training+test and RMSE validation are very similar.

Discussion of Overfitting
This result indicates that the neural nets are well trained and not overfit to the train and test set. Figure S1. (A) Structure of the neural network. Here ̃ denotes the normalized MoRSE intensities, and � denotes the predicted scaled stabilization energy. (B) Illustration of the random sampling process. A total of 4781 OSDA samples were shuffled. First, 20% of the samples were randomly selected and set aside as the validation set, and the remaining 80% as the training and test set. Then, 20% of the training and test set was randomly selected as the test set and the remaining 80% as the training set. The process was repeated 30 times, eventually producing 30 trained neural networks.

Model 2a Model 2b
Model 3a Model 3b Model 4a Model 4b Figure S6. PCOA scatter plots of the molecules with a predicted ML stabilization energy in BEA lower than -15 kJ/(mol Si) generated in the eight in silico runs. The green and red dots are the compounds generated in the model indicated at the lower left of each subplot. The red dots have a predicted ML stabilization energy in BEA lower than -17 kJ/(mol Si). The blue dots correspond to molecules generated in the other seven models.