Previous Article |
Table of Contents
| Next Article
PHYSICAL SCIENCES / BIOLOGICAL SCIENCES / APPLIED MATHEMATICS / EVOLUTION
Defining functional distance using manifold embeddings of gene ontology annotations

,
*Department of Mathematics, University of Minnesota, Minneapolis, MN 55455; and
Program in Bioinformatics, Boston University, Boston, MA 02215
Communicated by Ronald R. Coifman, Yale University, New Haven, CT, April 9, 2007 (received for review June 7, 2006)
| Abstract |
|---|
|
|
|---|
kernel methods | diffusion geometry | domain evolution | functional annotation | homology modeling
At first glance, the notion of functional distance is qualitative and subjective. The development of annotation systems that depict function in a machine readable format was the first step in treating functional annotation rigorously. For example, the Gene Ontology (6) (GO) has become the gold standard for describing molecular functions of genes and proteins. However, the GO is not naturally amenable to measuring distance. One complication is an intrinsic bias in annotation where large numbers of unrelated genes share the same annotation (ATPase), making those categories uninformative. Previous attempts at identifying functional relationships between genes focused mostly on calculating statistical over-representation of functional categories (7). These methods are well suited for quantifying coherence of function in sets of genes, but not useful for exploring structure–function or sequence–function relationships.
Recently, researchers have recognized the importance of measuring distance between annotations (8) and proposed a simple measure of distance using the shortest path algorithm (9). However, these kinds of distances lack resolution and are complicated by somewhat arbitrary characteristics of the ontology, e.g., when annotations on the same level differ in their degree of generality. Accordingly, we show that functional metrics based on shortest path algorithms perform significantly worse than methods based on diffusion-type manifold embedding (10) proposed in this work.
Defining distances between functional categories is integrally important due to potential insights into the coevolution of sequence, structure and function (11). For example, function broadly defined as all activities performed by a set of sequences that fold into a domain structure, can be represented as a weighted subgraph of the GO directed acyclic graph (DAG) (12). This representation of function was used to establish the importance of considering homology relationships in a phylogenetic context. In this paper, we introduce more accurate and sensitive functional distances based on diffusion-type manifold embeddings of GO annotations to explore the structure–function relationship in detail.
Manifold embedding techniques are based on kernels (see definition in Materials and Methods), which have already been successfully applied to various problems in bioinformatics (13). In particular, computational approaches aimed at integrating various data sets have explored the effect of adding GO kernels for use in subsequent classification by SVM (14). Although our approach also employs kernels defined on GO, there are several fundamental differences. Most importantly, we apply these kernels to quantify functional distances as opposed to applications centered on classification of data into specific categories. Moreover, our approach naturally extends the notion of functional distance to protein domains by using the geometric interpretation of the manifold embedding (see Materials and Methods). Finally, we apply functional distances to exploring coevolution of sequence, structure, and function.
Functional distances defined here via diffusion-type manifold embedding techniques allow for increased sensitivity and arbitrary levels of granularity. Using our measures of functional distance, we can estimate the average divergence of function with respect to structure, sequence or phylogenetic similarity. Although clearly an area of active research, we show that functional distances are already accurate enough to discover specific relationships between protein domain functions. Finally, we show how functional distances can be used to explore divergent as well as convergent evolution.
| Results |
|---|
|
|
|---|
Thus, the basic idea behind building an appropriate kernel is that GO terms shared by few protein sequences will be assigned small local distances or equivalently high values of local similarities. Alternatively, general annotations appearing at the top of the ontology will be assigned large local distances (or small similarities). Using the intuition outlined above, we form a graph where weights represent local similarities and use several techniques of manifold and graph embedding to calculate global distances between functional annotations. Embedding strategies exploit the underlying geometry of the graph and can implicitly correct ambiguities in the ontology. Finally, we use a global measure of distance between GO terms in combination with representation of domain function as a GO subgraph (12) to compute meaningful functional distances between protein domains [see Materials and Methods and supporting information (SI) Text].
Correlating Functional Distance with Sequence, Structure, and Phylogenetic Proximity. We use the well known correlations of function with sequence, structure (12), and phylogenetic profiles (15) to evaluate the efficacy of using manifold embedding to quantify functional relationships between domains. The embedding procedure involves defining local similarity weights as described above and using them to form a kernel (the types of kernels used here and their direct relation to the notion of manifold embedding are described in Materials and Methods). The choice of kernel is arbitrary, but integrally important in the definition of distance. Thus, we compared the performance of several kernels in their ability to accurately represent functional distance between protein domains. We report results for four different choices of kernels. The first three are formed by diffusion-type kernels, whereas the fourth is similar to previously proposed shortest distance between GO annotations (9).
We use Z scores (2) from DALI (16) to quantify structural proximity, BLAST (1) for sequence similarity and mutual information (MI) between phylogenetic profiles (15) for phylogenetic similarity (see Materials and Methods). We find that functional distances between protein domains calculated using diffusion-type kernels correlate well with sequence alignment, structural proximity and phylogenetic similarity (Fig. 1 a–c). Importantly, the dynamic range of the correlations is very large and the averaging due to binning almost insignificant. On the other hand, the distance metric based on the shortest path algorithm shows no significant correlation with either homology or phylogenetic similarity (Fig. 1d). A clear benefit of developing a rigorous functional distance metric is the comparison of functional information in sequence, structure alignment, and phylogenetic profiling.
|
|
Consistent with the explanation presented above, both the structural alignment and sequence alignment show increasingly sharp transitions when applying the LLE kernel, followed by the diffusion kernel with power m = 7 and at last the inverse Laplacian kernel (Table 1). Thus, manifold embedding of GO can produce a functional distances at needed resolution by choosing a kernel appropriate to the specific application. For maximum resolution at small functional distances, the LLE kernel is most appropriate, whereas maximum resolution at long distances can be achieved by using the inverse Laplacian kernel. However, as expected, the qualitative behavior of the correlations remains the same for all choices of diffusion kernels.
Building a Functional Domain Universe Graph. Next, we wanted to explore whether our definition of functional distance that correlates on average with sequence, structural, and phylogenetic similarities (Fig. 1) is accurate enough to yield biologically meaningful insights into the structure–function relationships of specific protein domains. We begin by creating a graph where nodes are domains colored by SCOP (17) fold annotation, and edges represent functional proximity calculated using the diffusion kernel (with m = 7). The graph is transformed into an unweighted version using an empirically derived threshold of F = 0.23. The resulting graph (Fig. 2a) illustrates both the specific functional relationships between individual domains and global relationships between folds and functions.
|
Although the graph shows separation of domains by fold and function, the structure–function relationship is clearly multifaceted. Functional clusters are not entirely monochromatic, e.g., functions are usually fulfilled by domains of several different folds. Some folds are also multifunctional and appear in clusters that are far from each other, e.g., Ferredoxins. Other folds are more functionally exclusive and only participate in clusters that are in close proximity, e.g., TIM beta/alpha barrel are mostly enzymatic functions. Finally, it appears that this representation of relationships between protein domain functions captures the separation of folds into functionally related superfamilies (17).
Exploring Structure–Function Coevolution. Interestingly, there are certain domains that link proximal clusters. These domains may represent the intermediates in the evolutionary path from one function to another. For example, consider two clusters (labeled B and E on Fig. 2a) populated mostly by 3-helical bundle domains. The B cluster contains domains responsible for DNA binding. Domains in this cluster bind to DNA nonspecifically (a representative structure is 1hlv, which is a centromere binding protein; ref. 18). On the other hand, the cluster labeled E is dominated by domains with the same 3-helical bundle structure, but those that bind to specific DNA sequences. These are mostly domains that carry out transcription initiation activity (a representative structure is the engrailed transcription factor 2hdd; ref. 19). Interestingly, there is one domain that also has the structure of a 3-helical bundle that is functionally proximal to both clusters and appears as the connecting hub. This domain is coded by a family of gamma-delta resolvases (1gdt; ref. 20). This is a family of proteins that binds to imperfectly conserved sequences (21) (Fig. 2b).
Clearly, sequence binding specificity is not explicitly described by GO. However, the 3-helical bundles are a remarkable example of how GO embedding and the subsequent graph theoretical treatment can uncover relationships between structures by placing their functions in biological context. Subsequent application of evolutionary trace methods to the three families can uncover the residues responsible for the differential binding specificity of the 3-helical bundles and their mutational dynamics.
Specificity of DNA binding in 3-helical bundle domains is an example of divergent evolution where sequences are related by common ancestry (22). On the other hand, convergent evolution is often defined as two proteins with no apparent homology performing the same function (22). An additional benefit of defining functional distances is that we can easily detect instances of convergence by examining domains with close functional distance and no structural similarity. For example, using functional distances, we easily confirmed the well documented case of convergence of tRNA synthases [1pys (23) and 1a8h (24), F score = 0.001 and Z score < 2].
| Discussion |
|---|
|
|
|---|
As an example of specific insights that can be uncovered using the proposed distance metric, we explore functional relationships between 3-helical bundle domains which form two clusters in function space. These functional clusters turn out to be separable by the specificity of DNA binding. The family of sequences that are functionally similar to both clusters binds with intermediate specificity. We were also able to confirm examples of convergence where domains sharing close functional proximity appear to have evolved independently. Further exploration of this representation of the protein domain universe will undoubtedly uncover many more insights into the relationship between evolution of structure and function.
Kernel-based functional distance metrics have several important advantages over previously described methods (14), Euclidean measures (12), and shortest path algorithms (9). First, the diffusion-type manifold embedding techniques give rise to distances taking into account both the geometry of the ontology and intrinsic biases in annotations in a robust way (insensitive to small amounts of noise). In particular, distances between subgraphs of annotation (e.g., those representing protein domains) have a clear geometric interpretation. Secondly, manifold embedding learns distances between annotations, rather than using kernels for classifications or defining distances between genes. Consequently, this approach is more natural for evaluating and comparing relationships between sequence, structure, and function as opposed to previous metrics that focused on applying GO kernels as part of a heterogeneous dataset for classification of protein–protein interactions (14). As a result, these methods are significantly more general and can be applied in calculations of functional distances between arbitrary numbers of genes. Additionally, techniques presented here can be easily adapted to other ontologies. Finally, correlations with sequence, structure (2, 15) and phylogenetic proximity (Fig. 1) show that metrics based on diffusion-type manifold embedding are significantly more accurate than previously proposed measures (9).
Having the ability to estimate "distance" in function space is fundamental to computational biology in the postgenomic era. A variety of computational tasks including assessment of annotation accuracy from homology modeling and module detection from microarray data can be facilitated by an accurate measurement of functional relationship between genes.
| Materials and Methods |
|---|
|
|
|---|
lerman/supp/protein_distance. More specific details of the methods are discussed in SI Text. Annotating Each Structure as a Subgraph on the GO. To annotate structures using GO (6), we use the strategy (12) of collecting all annotations for sequences (from NRDB; ref. 26) that fold into the structure and reconstructing all paths up to the root of the GO DAG.
Local Similarities Between GO Annotations. Formally, we form a unified graph G whose nodes are all annotation of GO appearing in protein domains and whose edges are the union of all edges of subgraphs representing protein domains. The local similarity weight wij on an edge connecting annotation i and j is defined as follows: wij = 1/nij where nij is the number of domain subgraphs containing that edge.
Similarities by Diffusion and LLE Kernels.
A (positive definite) kernel K for the unified graph is a real symmetric matrix whose size is N, the number of vertices of the unified graph, and whose eigenvalues are nonnegative. Its elements Ki,j represent local similarities between corresponding graph nodes (i and j). The diffusion kernels are based on local diffusion process on the unified graph. We first normalize the local similarity weights defined above by the degree matrix D, which is defined as follows:
|
|
The normalized matrix
|
|
represents local transition probabilities between GO annotations. Its symmetric version is
|
|
Following Coifman et al. (10), the diffusion kernel of "power" (or transition step) m is the matrix
|
|
A related diffusion kernel, suggested by Ham et al. (27), is formed by taking the pseudoinverse of the graph Laplacian, that is,
|
|
A similar LLE (28) (local linear embedding) kernel is obtained by following Ham et al. (27): We denote by e the uniform column vector of size N and length 1, that is, its elements are
. We then set
|
|
Finally, we denote by
max, the largest eigenvalue of M and form the LLE kernel by the formula
|
|
Other forms of diffusion kernels (29, 30) are described in SI Text.
Distances Between Annotations and Their Relation to Manifold Embedding.
Given a kernel K, we compute the distance d(x, y) between GO annotations x and y as follows:
|
|
This formula has a straightforward interpretation. Any kernel K can be written in the form: K(x, y) =
F(x), F(y)
, where F embeds the graph vertices into a Euclidean space (usually referred to as feature space). Consequently, Eq. 1 can be written as:
|
|
The distance d(x,y) thus represents the Euclidean distance between the embedded annotations (in feature space).
Assuming that the graph approximates a low-dimensional manifold or another continuous geometric structure, we view the graph embedding, F, as an approximation to a corresponding manifold embedding. The embedding and its corresponding distance are determined by the choice of kernel, which reflects geometric properties of the underlying graph or manifold. Indeed, when applying the diffusion kernel of power m (10), the corresponding distances measure the rate of connectivity between vertices according to paths of length m. The distances obtained by the inverse Laplacian represent the expected time to travel from one vertex to another vertex and then back to the original vertex (27). The LLE distance is similar to a diffusion kernel with low powers. The corresponding LLE embedding tries to preserve local distances to nearest points along the graph (see SI Text).
In the SI Text, we discuss efficient numerical evaluation of the functional distances for different kernels and large N.
The geodesic distances were calculated using Dijkstra's algorithm on the global GO graph (with local distances nij).
Distances Between Subgraphs Representing Protein Domain Functions.
We define the distance d(x,A) between a node x and the set of vertices A as follows:
|
|
The distance between the two sets of vertices A and B is then computed using the formula:
|
|
Variants of this "distance" and their properties are discussed in refs. 31 and 32.
Phylogenetic Similarity Between Protein Domains Based on Phylogenetic Profiles (P Score).
We evaluate the phylogenetic similarity between structures by BLASTing (1) the set of nonredundant sequences found to fold into each domain against all fully sequenced genomes. The similarity between any two domains is then just the empirical mutual information, MI, between their phylogenetic profiles (15). If x and y are two phylogenetic profiles, then
|
|
where pij(x, y), i, j = 0, 1, describe the frequencies of occurrence of all four possible combinations of presence (i/j = 1) or absence (i/j = 0) in the same genome for the two domains, pi(x), i = 0, are the frequencies of occurrence (i = 1) or absence (i = 0) in profile x and pj(y), j = 0, 1, are defined similarly. MI will be maximal if p00(x, y) = p11(x, y) = 0.5. That is, half of the terms of the two phylogenetic vectors are perfectly correlated (as a measure of nonorthologous gene displacement; ref. 33), whereas the terms in the other half are perfectly anticorrelated.
Curve Fitting (Fig. 1).
All curve fitting was done using Origin 7 SR1 (www.originlab.com). Exponential decay was modeled using the equation
|
|
The values of T when correlating functional distance with structure and phylogenetic similarity are reported in Table 1. The correlation between sequence alignment and functional distance was modeled by the logistic function:
|
|
Here, the slope reported in Table 1 is simply
|
|
All fitted functions had coefficients of determination in the range 0.89 < R2 < 0.97.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: GO, Gene Ontology; DAG, directed acyclic graph.
To whom correspondence may be addressed. E-mail: lerman{at}umn.edu or borya{at}bu.edu
Author contributions: G.L. and B.E.S. performed research and wrote the paper.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0702965104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||