New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Curvature of colinks uncovers hidden thematic layers in the World Wide Web

Communicated by Joel L. Lebowitz, Rutgers, State University of New Jersey at New Brunswick, Piscataway, NJ (received for review October 4, 2001)
Abstract
Beyond the information stored in pages of the World Wide Web, novel types of “metainformation” are created when pages connect to each other. Such metainformation is a collective effect of independent agents writing and linking pages, hidden from the casual user. Accessing it and understanding the interrelation between connectivity and content in the World Wide Web is a challenging problem [Botafogo, R. A. & Shneiderman, B. (1991) in Proceedings of Hypertext (Assoc. Comput. Mach., New York), pp. 63–77 and Albert, R. & Barabási, A.L. (2002) Rev. Mod. Phys. 74, 47–97]. We demonstrate here how thematic relationships can be located precisely by looking only at the graph of hyperlinks, gleaning content and context from the Web without having to read what is in the pages. We begin by noting that reciprocal links (colinks) between pages signal a mutual recognition of authors and then focus on triangles containing such links, because triangles indicate a transitive relation. The importance of triangles is quantified by the clustering coefficient [Watts, D. J. & Strogatz, S. H. (1999) Nature (London) 393, 440–442], which we interpret as a curvature [Bridson, M. R. & Haefliger, A. (1999) Metric Spaces of NonPositive Curvature (Springer, Berlin)]. This curvature defines a World Wide Web landscape whose connected regions of high curvature characterize a common topic. We show experimentally that reciprocity and curvature, when combined, accurately capture this metainformation for a wide variety of topics. As an example of future directions we analyze the neural network of Caenorhabditis elegans, using the same methods.
The World Wide Web (WWW) is a graph (1, 2) that is continuously being expanded by an enormous number of independent agents. Millions of users add pages in an uncoordinated way, but in doing so they must obey clear rules on how to address other pages (3). Beyond the information placed by users directly in individual pages, their unified action creates a cooperative metainformation whose location is in the connectivity of the network. By metainformation we mean information that is based solely on the graph structure of the WWW, and which reveals strong contextual grouping. This type of information should be contrasted to that of a very informative but secluded page that is an archive of information (a “wise hermit”). Further on, we will compare our new metainformation to earlier work.
The billions of pages and links of the WWW create a maze of confusing and almost unmanageable complexity. To navigate through it and to find pages of interest, users rely on the help of search engines (4). These send out small, automated programs (robots) that roam the Web, retrieve links, and note keywords as they create a mirror image of the Web that can be stored within one room of storage data, ready for easy access. The Web remains intelligible to its human users precisely because it is constantly analyzed and monitored by these automatic agents. Robot activity dominates Web traffic and accounts for up to 70% of the total queries made at certain sites we studied.§ Because of the breadth of their search, robots obtain a clearer view of the Web than the common user.
It is therefore useful to adopt the perspective of robots, without having to define meaning or content, looking only at the identification tag of pages (URL addresses) and their links (5). To avoid the difficulty of having to know the complete Web graph, we use robots that explore some locally visible pieces that can be found by starting from a given page and progressing along its links. An analogy may be drawn with classical neurobiology (6), where a neuron takes up dye that then traces out its connections.
Defining a Curvature on the Web
Having defined our aim of revealing metainformation in the form of hidden thematic structures, we implement it by a novel combination of a number of key ideas: clustering, colinks, triangles, and curvature. The clustering regroups a “home” page with the pages in its subdirectories. The colinks are then the result of independent agents linking their pages. Triangles capture transitivity, which we measure by the associated notion of curvature. By using curvature, we show that practically all the connective information depends on colinks. Thus, our method reveals a new geometric view of the locus of information in networks, which is not “put there” but “happens” as a collective effect.
In detail, we proceed as follows: The first step of our method is clustering.¶ Once a certain number of pages have been found by our robot, we regroup them into a node by lumping together a “home” page with the pages in its subdirectories. This clustering is useful because (i) information within a given home site belongs to one contextual heading; (ii) the size of the graph is reduced by a factor of 10^{2}10^{3}; and (iii) connective links that arise from physical proximity are eliminated, leaving the more relevant remote links.
Working henceforth exclusively with the clustered graph (like the one on the right of Fig. 1), we introduce the pivotal notion of colink for a reciprocal connection between two nodes, A and B (A points to B and B points to A, the red link in Fig. 1). They are then deemed to be “congruent” (friends, or members in the same interest group).
Colinks are special because, whereas we can point to your page we cannot (in general) cause you to point to ours. Congruence therefore indicates an awareness by both nodes of the other's existence and content, and a recognition of the other's value to their own interests. Because congruence involves mutual recognition, we expect it to be a transitive property, i.e., if sites A and B are congruent and B and C are congruent, then A and C will be congruent with high probability, forming a triangle. Such triangles signal strong cooperative content, and would be extremely rare in any large random graph whose average number of links per node is bounded (7). To quantify the aggregation of triangles with congruent edges, we define the local curvature at a node n by c_{n} = 2t_{n}/((v_{n} − 1)v_{n}). This quantity was termed the clustering coefficient in ref. 8. Here, t_{n} is the number of triangles containing n as a corner, v_{n} is the number of links leaving n (the valence), and (v_{n} − 1)v_{n}/2 is the maximal number of possible triangles. We ignore in this the directionality of oneway links. If the graph is a tree, there are no triangles, and c_{n} = 0 everywhere, whereas for any complete graph (where all nodes are mutually connected), c_{n} = 1 everywhere.
To see that c_{n} is indeed a quantity that can be called a curvature, we regard the simple geometric picture of Fig. 2: One way to detect that we are on a curved surface by using local measurements is to walk a unit distance on a straight line in two different directions separated by a constant angle α. Connecting the two endpoints completes a triangle, and the length of the third edge is determined by the curvature. On surfaces this relation of triangles to curvature is a natural generalization of the law of cosines, relating the angle at the apex and the three sides. For general metric spaces, such as infinite graphs, this relation is the basis of a mathematical definition of curvature (9) obtained by comparing triangles in a continuous metric to their counterparts embedded in standard manifolds.
The average number of triangles c_{n} at a node n is related to the average distance between any two of its nearest neighbors (say n′ and n"). This distance is measured by counting links in the shortest path from one node to another, namely n′ and n". Because for colinks this is either 1 (n′ and n" directly connected, triangle exists) or 2 (no triangle, need to go through node n), c_{n} = 2−〈r(n′,n")〉, where r is the distance. The average 〈⋅〉 is taken over all pairs of nearest neighbors so that the average triangle has sides 1, 1, 〈r〉. In principle, for oneway links the distance r may be a large number and can range all the way to infinity (no connection). Experimentally, we can measure triangles not only in the colinks but also in the oneway links. In that case, we ignore the direction of the link when measuring the distance r, and we shall still use c_{n} as measured by triangles to give us an upper bound on the connectivity.
We are not aware of a definition similar to the one of ref. 9 for finite graphs with discrete metric, but it is still useful to view c_{n} as the curvature of triangles with apex at n (though c_{n} is normalized to take values between 0 and 1),‖ with c_{n} being perhaps closer to a dimension measurement (10). Equipped with a metric (minimal number of hops between nodes) and a curvature, the graph of the WWW can be visualized geometrically. We shall say that it looks like a hyperboloid when it is treelike, with exponential separation of branches that “fan out” (ref. 11; see Fig. 3). In contrast, a highly connected region looks more like a sphere that is “closed” on itself. Alternately, a random walk by the robot in the graph will tend to get trapped somewhat longer in highly curved regions.
Triangles tend to aggregate, creating interest groups of widely varying size (in terms of the number of members), but of small diameter. In general, the border of interest groups is expected to be a sharp interface, but sometimes different interest groups will connect one to another, typically via a single link (e.g., the member of the “butterfly collection” interest group who also belongs to the “origami” interest group). Our definition of curvature implies that such connecting nodes tend to be hubs and are characterized by a low curvature.
The metainformation of the graph has been noted before (12) and has been used for data mining in the influential works (13, 14). These approaches view a link as conferring “authority” and search the WWW for “Authorities” and “Hubs” that point at them. Based on these ideas the successful search engine Google was designed; it also led to the notion of “Community” (15, 16). Similar investigations of the information content of linkages are common in other fields, such as studies of social (17) or paper citation (18) networks. Our approach, however, emphasizes different features. Indeed, note that “authorities” as defined above tend to have low curvature, because belonging to two interest groups basically halves the curvature for the connecting node. On the other hand, nodes that are mutually recognized by peers get high curvature. Thus, curvature captures contextual vicinity rather than notoriety of a set of pages.
Experimental Implementation
We tested our general ideas in several Web crawls, encompassing over 1 million URLs of which over 300,000 were actually accessed. The database is stored for offline analysis, whereas intermediate results are used to direct the further action of the robot. Large numbers of links in a page (indicating the page is a hub) tend to stall the crawl, and we followed at most 35 links per page (neither .cgi nor .jpg was explored). As an aside, we note that .com sites seem to make little contribution to “cooperative content” and connect badly to others.
As the crawl proceeds, clustering according to Fig. 1 is implemented. To that end, links are determined to be local or remote. A local link is basically one inside a given site and any other link is remote. This decision is usually made by looking at the URL and using the known conventions for naming of pages inside a home directory (details can be obtained from the authors at http://mpej.unige.ch/∼eckmann). The algorithm explores a page it encounters by following the local links down to depth d (and perhaps u upward, links like “back to home page”).
Crawls are initialized by choosing a first site (the anchor) and fetching it. The links in a fetched page are identified and entered into the database. Our robot then fetches all the pages to which these links point. After exhausting the search down and up, the database is clustered and the robot is sent out along the first remote link. It explores that node in a similar way, going down and up the newly found local links as before. Once that is done, the robot returns to the first node and follows the next remote link, until all remote links at distance r = 1 from the anchor have been followed. The robot can then be directed to explore r = 2 or sequentially larger distances, if needed.
The breadth of crawls is shown in Table 1, whereas results from specific crawls are shown in Fig. 4. The curvature was calculated for all nodes within a distance r = 1 of the anchor, and for three types of graphs. The “full” graph includes all links that were found in the crawl, the “congruence” graph is composed solely of colinks (the red link of Fig. 1), and the “oneway” subgraph is obtained from the full graph by removing all colinks (leaving only the black links).
Results
Our main result, shown in Fig. 4, is that the average curvatures of the congruence and full graphs are very similar, whereas that of the oneway graph is lower by a factor ranging between 2.5 and 21. This finding means that removal of the colinks results in a dramatic decrease of the curvature. The curvature of the full graph is carried predominantly by triangles containing at least one reciprocal side. This curvature is not produced by triangles with three colinks alone, because of their small relative number, but rather from triangles that have a combination of co and oneway links. To put the results of Fig. 4 in perspective, we note that the density of triangles (and average curvature as well) of a random graph tends to zero with the size of the graph (7).
For all experiments, the distribution of curvatures c among the nodes of the full graphs obeys a power law, c = (2.0 ± 0.2)v^{−1}, where v is the valence of the node. Although the scaling is striking, we are not certain of its origin. It does cause highly connected nodes with high v such as hubs and authorities to have, on average, a low curvature. This finding agrees with our intuition that such nodes contribute perhaps to global connectivity but not to the local interest groups that curvature identifies.
Although we consistently avoided any reference to content of sites in the crawl itself, to check our method a separate, objective criterion was needed. We therefore checked manually whether the colinks reveal contextual linkages, by reading in 784 nodes from our crawls (in most cases checking the name of the node sufficed). Indeed, we found that no less than 75–100% of nodes congruent to the anchor relate to the same topic.
Extensions and Discussion
Going beyond the WWW, we examined the efficiency of curvature as a measure of thematic cohesion by studying three other networks. These are the neuronal network of the nematode Caenorhabditis elegans, the protein–protein interaction network of the yeast Saccharomyces cerevisiae, and the citation network of the mathematical physics archive (http://www.ma.utexas.edu/mp_arc). In each case we used an existing database to access and reconstruct the network, and as a control each had a separate, objective criterion by which related nodes were grouped into “themes.”
Our analysis of the neural network of the worm relies on classic work (19) that used about 8,000 electron microscopy sections to trace directed neuronal connections (R. Durbin, http://elegans.swmed.edu/parts/neurodata.txt). The control classification is the currently accepted association of neurons with organs found by both morphological and connective data (19, 20), where the existence of triangles was indeed noted, though only incidentally.
In Fig. 5 we present the network of the brain of the nematode. We immediately note three connected components of high curvature, colorcoded to display the association with the major neural circuits or organs. The elevated groups are the amphids, the motor neurons of the nerve ring, and “other” sensory neurons of the head. The other circuits are either lowcurvature (egglaying and motor neurons of the ventral cord) or absent from the compilation (tail ganglia). The amphids and nerve ring are particularly well separated into connected components of the congruence graph. Furthermore, we find that interneurons [neurons that act like central processing units (CPU)] carry more curvature than sensory (input) or motor (output) neurons, which agrees with our intuition of how connections should distribute within the brain. Most colinks (61%) connect neurons that are within the same functional circuit (72% for colinks within triangles) whereas most of the oneway links (58%) connect between two different organs.
For the protein–protein interactions of the yeast we took the protein–protein interaction database (http://dip.doembi.ucla.edu/dip/Download.cgi), treating any interaction as a colink (i.e., there are no oneway links). The control was taken from the functional classification given in the Yeast Protein Databank (YPD) (http://www.proteome.com/databases/YPD/YPDcategories/Functional_Categories.html). We found that proteins within the same connected groups of high curvature almost always (77% of the cases) belonged to the same functional grouping of the YPD. At c_{n} ≥ 0.15 there are 34 such groups, that include between 3 to 18 members, with an average of 5 proteins per group (data not shown; see supporting information, which is published on the PNAS web site, www.pnas.org).
In the case of the mathematical physics archive, we used the archive (http://www.ma.utexas.edu/mp_arc) to define a network of citations between authors in the same field. A colink exists if author A cites author B (in any paper within the archive) and author B cites author A in any paper of his/hers, while coauthors are automatically colinked. Note that our definition of colinks is fundamentally different from the cocitations defined in ref. 18, because there no reciprocal recognition exists. The control we used here was more problematic, relying on scanning manually all conference pages listed at the International Mathematical Physics Association homepage, and including an author in a group (e.g., “Quantum mechanics”, “Field theory” or “Statistical mechanics”) if she/he spoke under that topic in any of the conferences. However, we were able to classify only about 400 of the 1,700 authors (because not every author is talking at these conferences). As before, high curvature selected out the groups we found manually, with 5 of the 20 or so topic groups we defined very strongly represented and clearly identified (data not shown; see supporting information). Again, authorities, which in this case are very polyvalent scientists, have lower curvature than specialists in a small field and serve rather as hubs.
Conclusion
Our discovery opens a number of new possibilities for studying highly assembled complex structures in general, perhaps even the brain. First, the introduction of geometry frees us from the subjectivity of contextual concepts, such as meaning, content, and the like. The locality develops a different direction than global notions like scaling in the WWW and the “smallworld” effect. Second, the concept of congruence shows the usefulness of combining geometry with selective action in probing sophisticated structures, including C. elegans. In particular, we have seen how a very simple dynamical rule—the search for congruence—radically changes our global view of the Web, replacing statistical aspects with unsuspected relations that can be found only by looking at combined, independently built pathways in the Web. Our approach can certainly be refined and extended in the Web but should prove useful in many other contexts.
Our study further shows that robots perceive the Web differently than the people who actually write and use the pages. Our robots go back and forth between various sites, gaining a more coherent view of their relations. We have shown that the geometric properties of the space in which they roam and the landscape that they reconstruct reveal new connective metainformation, hidden from the common user.
What can one expect in the future? Other forms of metainformation, not necessarily arising from connectivity, will certainly be found in the WWW and other complex networks. The dynamics of developing interest groups is also an important issue (21) that may involve rapid fusion processes (22). From a mathematical perspective, we have demonstrated the need for concepts of local curvature in graphs. Our definition of curvature seems a useful beginning to elicit Web properties and is easily generalized to balls of radius r = 2 (nextnearest neighbors) or more, as well as to simplices like tetrahedra. More advanced geometric concepts, however, such as measuring noncommutativity in the order of visiting sites, might be needed in the future.
Acknowledgments
We are grateful for numerous discussions with our colleagues in Geneva and Rehovot, and for special help from A. Haefliger and D. Ruelle. We also thank P. Blekken for help with Coin. This work was partially supported by the Fonds National Suisse and the Minerva Foundation (Munich).
Footnotes

↵† To whom reprint requests should be addressed. Email: eckmann{at}physics.unige.ch.

↵§ These findings are based on a random spot check of traffic in three sites: the molecular biology database (www.expasy.ch), the Weizmann Institute (www.weizmann.ac.il), and the mathematical physics archive (www.ma.utexas.edu/mp_arc).

↵¶ Note this is a different kind of cluster than that described by the “clustering coefficient” of Watts and Strogatz (8).

↵‖ A very accessible exposition of these ideas can be found in the Balzan Lectures of M. Gromov: http://www.balzan.it/english/pb1999/gromov/paper.htm.
 Received October 4, 2001.
 Accepted February 14, 2002.
 Copyright © 2002, The National Academy of Sciences
References
 ↵
 ↵
 Broder A,
 Kumar R,
 Naghoul F,
 Raghavan P,
 Rajagopalan S,
 Stata R,
 Tomkins A,
 Weiner J
 ↵
 Barabási AL,
 Albert R
 ↵
 ↵
 Pirolli P,
 Pitkow J,
 Rao R
 ↵
 Cajal R
 ↵
 Bollobas B
 ↵
 Watts D J,
 Strogatz S H
 ↵
 Bridson M R,
 Haefliger A
 ↵
 ↵
 ↵
 Botafogo R A,
 Shneiderman B
 ↵
 Kleinberg J M
 ↵
 Brin S,
 Page L
 ↵
 Gibson D,
 Kleinberg J M,
 Raghavan P
 ↵
 Flake G W,
 Lawrence C,
 Lee Giles C
 ↵
 Wasserman S,
 Faust K
 ↵
 ↵
 White J G,
 Southgate E,
 Thomson J N,
 Brenner S
 ↵
 Wood W B
 ↵
 ↵