Transformational machine learning: Learning how to learn from many related scientific problems

Significance Machine learning (ML) is the branch of artificial intelligence (AI) that develops computational systems that learn from experience. In supervised ML, the ML system generalizes from labelled examples to learn a model that can predict the labels of unseen examples. Examples are generally represented using features that directly describe the examples. For instance, in drug design, ML uses features that describe molecular shape and so on. In cases where there are multiple related ML problems, it is possible to use a different type of feature: predictions made about the examples by ML models learned on other problems. We call this transformational ML. We show that this results in better predictions and improved understanding when applied to scientific problems.

The pXC50 provides a continuous scale of 1-12 where a compound of the value 1 is the least potent inhibitor and requires a large concentration of the drug to achieve 50% inhibition and 12 is the most potent inhibitor requiring a very low concentration to achieve 50% inhibition. In a small proportion of cases, where multiple activities have been reported for a particular compound-target pair, a consensus value was selected as the median of those activities falling in the modal log unit. Therefore, the unit of activity we are referring to is the pseudo-pIC50. In the end, we understood the problem of learning QSARs as the regression task of predicting the pXC50 activity given a chemical compound represented with the 1,024-FCFP4 fingerprint. TML dataset input columns are formed using the predicted pXC50 activities from the baseline models on the test subsets of each CV iteration. TML dataset output columns are the same pXC50 activities as used for the baseline datasets.
We employed five machine learning algorithms that are deemed very popular and their implementations, easily accessible from R: random forest (RF, as implemented in the range R package), support vector machine (SVM, ksvm R package), k-nearest neighbour (KNN, FNN R package), neural networks (NN, tensorflow.keras python package), and extreme gradient boosted trees (XGB, XGBoost R package). Hyperparameters were selected as follows: in all RF experiments, we used 500 trees, a third of the total number of variables were considered at each split, and five observations were used in each terminal node. For the experiments with SVMs, we used RBF kernels with gamma value of 0.5 and cost of 1.0. The chosen RF and SVM hyperparameter sets were the ones that produced best overall performance after having been tested on a smaller subset of datasets randomly selected. For KNN, the number of neighbours ('k') was chosen individually for each QSAR model using an inner cycle of 3-fold cross-validation. For NNs, we tested on a small subset of the datasets several fully connected feedforward architectures. In addition, we used dropping-out and L2-penalisation at different rates in order to minimise the risk of overfitting. In the end, we chose for the baseline experiments an architecture that consisted of 1 hidden layer with 128 neurons and 1 output neuron. ReLU activation functions were used in the hidden layer, whilst the output neuron had a linear function as traditionally used in regression problems. For the TML experiments, the NN architecture consisted of 2 hidden layers, the first one with 712 neurons and the second one, with 128, and both with ReLUs. We used ADAM as the optimiser in both set of experiments. For the baseline and TML experiments, XGB's hyperparameters for each dataset model were chosen by exploring the following grid: number of rounds values 1000 and 1500, learning rate values in 0.001, 0.01, 0.1, 0.2, and 0.3. The hyperparameter set producing the best model performance was chosen using an inner validation split of 30%.
The stacking experiments were performed using convex linear regression and ridge regression. The linear regression model was set up such that the weights of the coefficients are non-negative and sum to 1. The ridge regularization parameter was tuned using internal cross-validation. We assessed the performance of all the models using 10-fold cross-validation. The code is available at https://github.com/iaolier/TML-QSAR.

Gene Expression Learning
We utilized the Library of Integrated Network-based Cellular Signatures data (LINCS). This data describes the effect of drugs in cancer cell lines on the expression levels of 978 landmark human genes. We used LINCS Phase II data (accession code GSE70138), which consists of 118,050 experimental conditions, along with the corresponding expression levels for 978 landmark genes. We generated attributes for each perturbation condition using the accompanying metadata. Each experimental condition is associated with a perturbagen (drug), cell type and site, perturbagen dosage, and perturbagen time frame. In total, there are 30 cell types (ct), 14 cell sites (cs), 83 dosages (d) and 3 time points (tp). Of the 2,170 drugs in the dataset, 1,795 have valid chemical structures (canonical smiles codes) according to the metadata. We converted the canonical SMILES to the a 1,024 bit FCFP4 finger-prints (fp) using RDKit (Landrum, 2016 ]. This generated a 107,152 by 1,155 experimental condition matrix, row and column identifiers included, which can be used as input for building models to predict the expression levels of the 978 genes using traditional machine learning techniques. For each gene we generated both a train and test set with 7,000 and 3,000 samples respectively. We did this by first randomly splitting the original perturbation condition data with 107,152 samples and their corresponding gene expression levels, into train and test sets of 70% and 30% respectively. Using this main train and test set, we randomly sampled train and test individuals for each gene. The gene expression levels for the 978 genes were normalised such that their values lie between 0.0 and 1.0. We used five learning algorithms: random forests (RF), gradient boosting machines (XGBoost), support vector machines (SVM), k-nearest neighbors (KNN), and neural networks (NN). For RF 500 trees were grown, a third of the total number of variables were considered at each split, and five observations were used in each terminal node. For XGBoost, most hyperparameters were left in their default setting while the following were tuned using a grid: number of rounds = (500, 1000, 1500, 2000), max depth = (2,4,6,8), and learning rate = (0.001, 0.01, 0.1, 0.2, 0.3). The SVMs were built using an epsilon of 0.01, cost of 0.25, and gamma of 0.5, having learned that these values perform reasonably well through data exploration. Five neighbors were used for KNN. We used a NN with two hidden layers, the first layer contains a third of the total number of input nodes, and the second layer contains a third of the number of nodes in the first layer. Since it is a regression problem, there was only one output node. All experiments were performed in R and the code and dataset are available at https://github.com/oghenejokpeme/TML-gene-expression and http://dx.doi.org/10.17632/2djzy3p9p9.1 respectively.
Due to computational expense, we performed the transformative experiments using a sequential monotonic increase of the number of input features. That is, having performed the baseline case experiments for a given learner, we do not then perform the transformative case for a gene using 977 gene features. Instead, we perform the transformative case using 50 features, then 100 and so on, such that all the gene features in the previous set are also present in the current set. The results from the transformative models reported in the main manuscript were built using 500 features for RF and SVM, 300 for KNN, and 50 for NN. The stacking experiments were performed using convex linear regression and ridge regression. The linear regression model was set up such that the weights of the coefficients are non-negative and sum to 1. The ridge regularization parameter was tuned using internal cross-validation.

Meta-Learning for Machine Learning
The third problem domain is in meta-learning for machine learning. The specific problem consisted in predicting the performance of a machine learning method (given an exact configuration) on a new task, given the characteristics of the training data (e.g. statistics of the training data distribution). Domain problems are assumed to be related by having similar data distributions, data defects (e.g. missing values), or by containing data being generated by similar processes. The properties used to describe the datasets themselves are typically called meta-features.
From OpenML we retrieved data from an earlier meta-learning study (Details can be found on https://www.openml.org/s/7). Although we had to exclude a few tasks and algorithms because they lacked sufficient evaluations in OpenML, this yielded a set of 10840 evaluations on 351 tasks (datasets) and 53 machine learning methods (called flows on OpenML) from mlr (Bischl et al., 2016). From each task, 21 dataset descriptors were extracted, such as the number of examples, number of missing values, and percentage of numeric features. We formed meta-datasets, one for each machine learning method. An observation within a meta-dataset represents an original OpenML task, and each feature, a dataset descriptor. The original aim of the study was to predict the area under the ROC (AUC). Therefore, in total, we produced 53 meta-datasets with a diverse number of OpenML tasks, ranging from above 100 to about 250. We applied transformative learning to transform the original representation of the datasets into extrinsic descriptors of the OpenML tasks. Similarly to the other two problem domains, five ML algorithms were selected to do the transformation: RF (500 trees), SVM (RBF kernel, gamma = 0.5, C=1.0), KNN ('k' chosen by an internal 3-fold cross-validation cycle), and NN (1 hidden layer, 10 hidden neurons with ReLU activation functions, 1 output neuron with linear activation function), and XGB (similar to the other two problems). The transformed descriptors were generated by predicting AUC using all available models -excluding the one from the which the OpenML task belonged. In this way 52 extrinsic descriptors were generated for each OpenML task. Model performances were assessed using 10-fold cross-validation. The baseline models were manually tuned. For the RF models, the main hyperparameters are the number of trees and the number of random features used in each split (mtry). After several rounds of experimentation, we used 500 trees for every model, and used the default heuristic (the square root of the number of features) to set the mtry hyperparameter depending on the dataset. Further tuning did not significantly improve performance.

Clustering chemical compounds and protein targets
To learn whether the TML descriptions of the chemical compounds express valuable relationships between them, we clustered them using hierarchical clustering with the HDBSCAN algorithm [McInnes2017]. To make the clustering pharmacologically relevant we focussed on the highest activities, and only kept the 10% highest activities and set the remainder to 0. This results in a sparse representation. Columns that become entirely 0 are removed for the clustering. These correspond to the protein targets for which no high activities are predicted for any drug. Because of the high dimensionality of this data, we used a normalized Euclidean distance to obtain the hierarchical clusterings. To select flat clusters from the cluster tree hierarchy, we used the Excess of Mass selection method, and a minimal cluster size of 3.
To cluster the protein targets, we follow exactly the same procedure, but now on the transpose of the sparse matrix. Since this matrix has drug compounds in the rows and protein targets in the columns, the transpose yields a representation of the protein targets in terms of the predicted activities for all known drugs. Again, we only keep the 10% highest activities and remove the columns that become entirely zero (the chemical compounds that don't have a high activity on any protein target). The hierarchical clustering was again done with the HDBSCAN algorithm and normalized Euclidean distances. To estimate distances between the chemical compounds, and produce figures 3b and 3c, we performed dimensionality reduction using the tSNE algorithm [vanderMaaten2008]. In Figures 3a and  3b, we embedded the data in two dimensions, and color-coded the chemical compounds and protein targets according to their assigned clusters (learned with HDBSCAN as described above), using black for the 'singleton' elements that do not belong to any cluster. The code to produce the clustering can be found on: https://github.com/joaquinvanschoren/transformational-learning

FAIR Sharing
To enable reproducibility, all of the thousands of datasets (QSAR, LINCS, and Meta-learning), the links to the code (TML, RF, XGB, SVM, KNN, NN), and the ~50,000 ML random forest models (counting all decision trees) models are available under the creative commons license at the Open Science Platform: https://osf.io/vbn5u/ This amounts to ~100GBs of compressed data. Few ML projects have put online so much reusable data.
To maximize its added-value, we follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for publishing digital objects (Wilkinson et al., 2016). The aim of FAIR movement is to produce machine-actionable digital resources, which will facilitate discovery, evaluation, data and knowledge integration and reuse by the community.
· BIOPAX (Biological Pathway Exchange) ontology (www.biopax.org) to capture information about genes in the datasets and models.
The whole set of the metadata used for our transformative learning study is encoded as a tmlmetadata ontology. It is available at: www.purl.org/TML and https://bioportal.bioontology.org/ontologies/TML in the following formats: OWL, CSV, RDF, XML. 6. IAO: Model (a generalization of a set of training data able to predict values for unseen instances. It is an output from an execution of a data mining algorithm implementation) 7. MLS: Model evaluation (a setting of a value of the performance measure specified by the evaluation specification. It connects a measure specification with its value) 8. IAO: Programming language (a language in which source code is written, intended to executed/run by a software interpreter. Programming languages are ways to write instructions that specify what to do, and sometimes, how to do it). 9. MLS: Software (is implemented computer programs, procedures, scripts or rules with associated documentation, possibly constituting an organized environment, stored in read/write memory for the purpose of being executed within a computer system) 10. MLS: Task (a formal description of a process that needs to be completed (e.g. based on inputs and outputs). A Task is any piece of work that needs to be addressed in the data mining process. In ML Schema, it is defined based on data) 11. IAO: Version (an information content entity which is a sequence of characters borne by part of each of a class of manufactured products or its packaging and indicates its order within a set of other products having the same name).
For example, one of the thousands of produced in this study models is the AARS-Random Forests predictive model. Its metadata capture the following information about the model and the process of its generation: The input data is the LINCS dataset, the employed algorithm is Random Forests, the model was evaluated using RMSE, the model includes such gene as AARS and available under the CC BY-SA 2.0 license. Table 1 provides a summary of how the recording and publishing of tml study outputs are compliant with each of FAIR principles.