User-controlled mapping of significant literatures
Abstract
We apply a version of our web-based literature-mapping system to PNAS for 1971-2002, as indexed by the National Library of Medicine and the Institute for Scientific Information. Given a single input term from a user, a medical subject heading, a cocited author, or a cocited journal, pnaslink rapidly displays views in which that term and the other 24 terms that most frequently co-occur with it in a bibliographic database are interrelated in ways suggesting fruitful combinations for document retrieval. The interrelationships are produced by two algorithms, pathfinder networks and Kohonen-style self-organizing maps. pnaslink displays are themselves interactive interfaces that can retrieve documents from digital libraries (e.g., PNAS Online). This style of visualizing knowledge domains is called “localized” because it does not attempt to map the indexing of literatures in full but concentrates on the top terms in an “associative thesaurus” reflecting user interests. It also permits swift remappings, as the user recognizes terms worth pursuing. pnaslink is illustrated with maps drawn from the literature of population genetics. Some comparative and evaluative comments are added, one from a domain expert indicating that the face validity of the system may be tempered by insufficient specificity in the indexing terms being mapped.
We here present two ways of rapidly mapping literatures in terms of selected indexing vocabularies. Both ways are responsive to users, and either can serve as an interface for retrieval of documents from digital libraries. Either can also complement a work that focuses on the structure of a literature, such as a research review (1). Our data are the contents of PNAS for 1971-2002, as described by medical subject headings from the National Library of Medicine (NLM) and by citation indexing from the Institute for Scientific Information (ISI).† Indexing by these organizations typifies the bibliographic control that is extended only to significant, that is, highly valued, literatures. Our software, called pnaslink (available at http://project.cis.drexel.edu/pnas), is designed to amplify such control, by enabling customized browsing on the basis of user input.
Both of our mapping techniques exploit co-occurrences of terms in NLM and ISI bibliographic records. The terms are systematically paired, and their co-occurrences are counted in matrices. Because people can easily assimilate numeric matrices only when they are recast as pictures of some kind (2), one of our techniques transforms the counts into Kohonen-style self-organizing maps (SOMs) (3), and the other transforms them into pathfinder networks (PFNETs) (4). SOMs show frequently co-occurring terms as nodes that are spatially close. PFNETs show them as nodes with explicit ties. The two kinds of maps will be exemplified here with medical subject headings (MeSH) and cocited authors in a specialty of genetics.
Other researchers have visualized bibliographic data with PFNETs and SOMs (2, 5), but we use them to map significant literatures in real time with retrieval capabilities built into the maps (refs. 6 and 7 and cf. ref. 8). The data are initially processed by our noah indexing engine, a specialized database application we designed for fast computations with verbal co-occurrence data (9). With noah, mapping time is determined by the size of the indexing vocabulary, not by the number of documents in the database. In conceptlink, a predecessor of pnaslink, for example, we can almost instantly create maps of MeSH terms from >12 million MEDLINE records. (conceptlink maps the co-occurring MeSH indexing of the journals in NLM's PubMed. It is available at http://project.cis.drexel.edu/conceptlink.) Once the data are indexed, the user can map and manipulate them through a unified web interface.
The maps (Figs. 1 and 2) are based on term counts solely from PNAS records because we were set that task as participants in this colloquium. Elsewhere, we have mapped terms drawn from the NLM and ISI databases in full.‡ However, even by itself PNAS is a major interdisciplinary resource, and we can easily imagine PNAS online or other journal-specific web sites offering domain visualizations like ours for the benefit of users.
In refs. 6 and 7, we discussed authorlink (http://project.cis.drexel.edu/authorlink), the version of our software that is used to map cocited authors from ISI's Arts and Humanities Citation Index for 1988-1997. The present article attests to the quick adaptability of the conceptlink/authorlink software to the authors, journals, and MeSH terms in the PNAS data. This in turn prompts us to offer a rationale for our general approach, the localized mapping of association thesauri, along with different accounts of the two main algorithms. We describe certain interactive features of our system and conclude with some fresh comparative and evaluative data, including an expert's commentary on pnaslink as applied to the domain of population genetics.
Localized vs. Global Mapping
In their extensive review, Börner et al. (5) emphasize that “painting a big picture” is a main goal in domain mapping. This may lead to a strategy of mapping very large co-occurrence matrices in their entirety. Indeed, system designers have made many significant developments in software for such global portrayals of literatures, e.g., themescape and vxinsight render literatures as landscapes; galaxies and starrynight render them as astral bodies (10-12). Ours, however, is an alternative way of visualizing knowledge domains, the localized mapping. Perhaps the chief difference is that the localized approach relinquishes scope to increase the user's control of the mapping process.
Table 1 helps to sharpen this comparison. In global mapping, system designers present the user with a preformed view, often in 3D, of some sizeable literature. Within the panel of visualization, landscapes invite flyovers; star-fields or other constructs invite flythroughs. In the former, peaks representing major accretions of documents on some subject are likely to exert a powerful pull on the user; in the latter, document points coded as important, e.g., by differences in shape, size, or color, exert a similar pull. Essentially, the user is engaged in old-fashioned browsing, as of book titles in library stacks, but system designers may minimize or even eliminate labeling of objects in the map because labels clutter precious screen space and block the metaphorical presentation (see examples in ref. 12). The user explores the view by “visiting” or “homing in on” objects of interest, rather as in video games, but typically cannot remap the literature in pursuit of some new interest because a new map takes hours of computer time to create.
In contrast, our localized system of mapping more closely resembles online searching. The user starts the process by entering a single term at a web interface. This is consistent with the way most people search the web (13) and is intended to minimize cognitive demands on users. It is true that pnaslink must be entered with MeSH or ISI-style terms instead of whatever word pops into the user's head, but our system includes guides that help one make the proper entries. The system responds to the entry (or “seed”) term by forming a list of the terms that co-occur with it, ranked high to low by frequency. The seed term and its 24 next-highest neighbors are then exhibited as a PFNET or a SOM, which the user can switch between. Each of the two modes of mapping in pnaslink yields different insights into the relations of the indexing terms. Both modes place the user's seed term in the locale of a limited number of other terms that are guaranteed to co-occur with it, thus customizing browsing. Any of these terms, if selected by the user, will be automatically “ANDed” with the seed term in retrieving documents.
Mapping only 25 terms at a time is an arbitrary design decision with several advantages. It allows pnaslink to make maps on the fly in seconds. It affords the node labels, the indexing terms, enough room that they have little or no overlap, thus making them and their interrelationships the primary features of the display. It gives the user a rich, but not overwhelming, array of associations to work with. Finally, because of its speed, it permits users to create new maps on the basis of single or combined terms from an old map. Thus, instead of visiting different places in a global visualization, one moves locally from interest to interest by point-and-click remapping (which accords with Hearst's point in ref. 10 that an interactive system should let users change their search strategies as their goals change). One also moves by recognizing terms of interest rather than by having to guess them or look them up in a thesaurus.
The Associative Thesaurus
If the indexing terms used in the mapping are indeed controlled by a formal thesaurus, our SOMs and PFNETs provide an alternative: they display the top listings in what is sometimes called a term's associative thesaurus (2). Formal thesauri are published in hard and soft copy; associative thesauri are created ad hoc within search software. A formal thesaurus, such as NLM's MeSH, brings out a term's standard linguistic features, e.g., its definition, synonyms, hypernyms, and hyponyms. In contrast, an associative thesaurus shows what terms co-occur with it when it has been used to index actual publications (cf. ref. 14).
In NLM's MeSH, for example, the term Anthrax is related to Bacillus anthracis and subordinated to Bacterial Infections and Mycoses. But if Anthrax is mapped in our system, which covers the biomedical literature through 2002, its top co-occurring terms include Postal Service, a connection obviously never to be part of its entry in MeSH. Associative thesauri are shaped by historical contingencies, by what is being written about. That is why they may be useful for online retrieval in ways that formal thesauri are not.
Not all indexing uses subject headings and formal thesauri, of course. ISI's indexing, for example, allows searchers to retrieve the items that cite a given author. From that capability, online searchers with the right software can move to retrieving items that cite pairs of authors jointly. To people literate in a domain, frequently cocited authors may suggest nuances of meaning that are absent in standard subject indexing (for example, articles that cite both Derek de Solla Price and Diana Crane may bear on “invisible colleges” in science even if that phrase does not appear in their bibliographic records). A map of cocited authors is, in effect, an associative thesaurus of authors linked by conjoint use of their works. Again, these linkages may permit useful retrievals that are not otherwise possible (1).
Additional Capabilities
pnaslink can produce maps not only of associated MeSH terms and cocited authors but also of cocited journals. That is, if a user supplies the name of a seed journal, such as Gut or Cell, pnaslink maps the top 24 journals cocited with it in PNAS. Journal maps are most likely to be of interest to professional literature managers, such as serials librarians, whereas maps of MeSH and cocited authors are intended more for users in general.
Guided by our emphasis on user control, we have implemented several interactive functions for pnaslink. For example, the system lets the user regenerate the maps after removing some terms. This is helpful, for example, in journal mapping, when one may want to eliminate omnibus journals like Science and Nature from a map to focus on more specialized titles.
pnaslink also has alternate data models to show term relationships from different perspectives. By default, the seed term is used to generate 24 other terms, but then the counts for these pairs are obtained without reference to their counts with the seed term. However, if the user chooses the “tri-citation” option, the seed term is always required to be present with other two, and the maps are accordingly different.
Throughout the interaction process, the user can directly retrieve documents by subject through PNAS Online. Every time the user clicks on a MeSH term, it is added to a query list. When the user clicks on the find button, a separate window opens to show the documents retrieved from PNAS Online by the terms in the query list. The maps are thus a “live” interface that allows the user to interact with terms to see what documents they yield. (PNAS Online lacks ISI-type indexing, which prevents the cocited author retrieval possible in, e.g., our authorlink system).
Two Modes of Mapping
PFNETs and SOMs are dimension-reduction techniques that have been used to visualize the structure of literatures for more than a decade. In the context of the movement joining bibliometrics with document retrieval (2, 5, 10), PFNETs have been described by Fowler and colleagues (15-17), McGreevy (18), and Chen (19, 20). Analogous accounts of SOMs have been done by Lin et al. (21), Roussinov and Chen (22), and Chen et al. (23).
PFNETs. Characterizing PFNETs, Börner et al. (5) write, “Pathfinder algorithms take estimates of the proximities between pairs of items as input and define a network representation of the items that preserves only the most important links.” Our input is pairs of terms, and the pairs are linked as output only if their co-occurrence counts are the highest (or tied-highest) in their respective vectors. By emphasizing only the most prominent links, PFNETs reduce the user's cognitive load in interpreting the most important relationships depicted in the map. These relationships in particular are highlighted as potentially fruitful for retrievals.
PFNETs were developed to portray the results of studies in which subjects' judgments of the closest semantic items were represented by the lowest weights. That is, the algorithm selects the lowest-weight (also called minimum-distance or minimum-cost) paths to render the most salient ties. However, in our matrices the closest connections are signaled by the highest co-occurrence counts. The counts therefore require a transformation (subtraction from a constant) to convert them to a distance measure before PFNETs are actually plotted.
In PFNETs, nodes represent terms, and the importance of links between them is measured by path weights, computed from term co-occurrence counts. The PFNET algorithm compares these weights over both direct (one link) and indirect (multilink) paths between nodes. It retains just those links that constitute minimum-weight paths. Such paths are required not to violate the triangle inequality d(a,c) ≤ d(a,b) + d(b,c), where d is the distance between points a, b, and c. These paths will be direct unless an indirect path is computed to be shorter.
The number of links in a PFNET is controlled by two parameters, r and q. These are set in our software so as to produce the sparsest possible network, which occurs when r equals infinity and q equals n - 1, where n is the number of nodes in the matrix.
The parameter r, which determines how path weights are computed, is lucidly explained by Fowler et al. (17): “Path weight, r, is computed according to the Minkowski r-metric. It is the rth root of the sum of each distance raised to the rth power for all links in a path between two nodes. Although the r-metric is continuously variable, simple interpretations exist only for r = 1 (path weight is the sum of the link weights in the path), r = 2 (path weight is the Euclidean distance), and r = infinity (path weight equals the maximum link weight in the path). One advantage of r = infinity is that one need only assume that the original distance estimates have ordinal properties. Another advantage is that the link structure will be preserved for any monotonic transformation of the data.”§
The parameter q sets the range within which all paths of length q will be examined in the test of the triangle inequality (24) and removed if they violate it. The larger the value of q, the more extensive the triangle inequality constraint; therefore, links are more likely on a path that violates the rule. If q is one less than the number of nodes, then all of the potential violators are under scrutiny.
The settings r = infinity and q = n - 1 are widely used in pathfinder research because they tend to produce networks that are highly intelligible simplifications of the data. An algorithm called a spring embedder (25) is used to enhance the maps by minimizing unsightly features such as crossed links and overlapping nodes. The finished map is virtually instantaneous once a seed term is entered.
SOMs. Unlike PFNETS, which explicitly join highly related terms, SOMs render semantic relationships through a distance metaphor. The more frequently co-occurring terms, which presumably have greater mutual relevance, occupy more proximate regions on the map. SOMs are designed to render not just the highest co-occurrence counts between terms, but rather relatively high co-occurrences across groups of terms. They are a softer-focus kind of mapping than PFNETs, but they, too, suggest specific combinations of terms on which the user might want to base retrievals.
The pnaslink algorithm extracts the proximity relations of data in 25 dimensions, one for each of the input terms paired with all others, and seeks to preserve them as closely as possible in 2D. This process of self-organization (also known as unsupervised learning) runs over many iterative cycles. In each iteration, the images of term pairs that are strongly related in the high-dimensional space will be moved closer on the lower-dimensional space until stability is reached.
More specifically, the 2D grid of pnaslink consists of 64 output nodes distributed in an 8-by-8 pattern. Each output node corresponds to a vector of 25 weights that are initially set as small random numbers. Each is also connected to 25 input nodes, and the latter correspond to vectors in the 25-by-25 matrix comprising all possible pairs of a seed term and the 24 terms most frequently co-occurring with it. [There are 25(24)/2 = 300 unique pairs in the matrix, and the main diagonal, consisting of terms paired with themselves, is not used.] This co-occurrence matrix is used to train the SOM.
The account of pnaslink's parent authorlink (6) describes the iterative training process as follows. A row from the co-occurrence matrix “is randomly selected and compared to every output node to determine a winner. Weights of the winning output nodes then are updated so that the next time this input node is presented, this output node will likely be selected again as the winner. In the meantime, nodes surrounding the winning node are similarly adjusted. The number of iterations needed to train a SOM is often determined empirically (in our case, we optimize the number of training cycles to 2,500). After the training, input vectors closest in the input space will map to the same regions in the output map. The regions are delineated by areas of nodes in which the elements with the highest value on the vectors are the same.”¶ SOMs, like PFNETs, usually take only a second or two to produce.
In interpreting SOMs, points in the same area are held to be closely related. Adjacent areas reflect stronger relationships than nonadjacent areas. Terms in large areas are more influential than terms in small areas.
Examples from Population Genetics
Fig. 1A is a PFNET, and Fig. 1B is a SOM formed with the MeSH term Gene Frequency as the seed. The result is a complex, yet still radically simplified, picture of term relations in population genetics as that subject has developed in PNAS. Fig. 2 repeats the same map types with a cocited author as seed, in this case, the population geneticist Montgomery Slatkin (University of California, Berkeley), a leading researcher in the study of gene frequencies and genetic drift. In Fig. 2A, the author cocitation counts have been toggled on so that they appear above the links, an option not exercised with the term co-occurrence counts in Fig. 1A.
The two map types suggest specific terms from the literature that can be used in document retrieval. The interface of which the maps are part has been cropped away to focus on terms that are related in ways that the literature searcher often does not know in advance. Someone interested in exploring the connection between, say, Gene Frequency and Mathematics or between, say, Slatkin and Luigi Cavalli-Sforza could click on the appropriate labels and retrieve documents in which those particular conjunctions occurred. They would be documents for which Gene Frequency and Mathematics co-occur as subject headings or in which Slatkin is cocited with Cavalli-Sforza. (Further terms may be added at will.)
In Fig. 1A the main nodes in the PFNET are (from left) Alleles, Mutation, Genes (Structural), DNA, and Evolution, a transition from relatively specific to relatively general terms as one moves rightward. The seed term Gene Frequency is seen to be an offshoot of the literature on Alleles. Indeed, if Gene Frequency is required to be present as a third term in all pairings in the map (the tri-citation option mentioned above), the new map has Alleles at the center with 19 of the other terms radiating directly from it.
In the SOM in Fig. 1B, the most central term, the one whose region touches most others, is Mutation. Gene Frequency is placed near the same terms it appeared with in the PFNET, and other connections between the PFNET and the SOM can be traced, but the SOM emphasizes different relations than the PFNET. For example, the two terms for fruit flies appear apart in the PFNET, whereas the SOM brings them together at lower left.
Because Fig. 1 shows term relationships solely within PNAS, the question arises whether a mapping of Gene Frequency would differ markedly across all of the journals covered by NLM's PubMed. The latter mapping is possible through our system conceptlink. It turns out that the two maps have 13 terms in common, which demonstrates the breadth of PNAS in representing topics in genetics. (However, the PFNETs have only four links in common.) Table 2 shows the common and the unique terms. Those unique to PubMed seem more specific and more oriented toward human genetics.
Many of the MeSH terms associated with Gene Frequency in Fig. 1 appear in chapters on population genetics in introductory genetics textbooks, and they are the sort of terms that turn up in textbook glossaries (e.g., Haplotypes, Heterozygotes). Ironically, beginners at the glossary stage may know too little to profit from maps like those in Fig. 1, whereas advanced students and experts may know too much. Asked to comment on Fig. 1 as a domain expert, Slatkin said that the terms and their groupings in the two maps were intelligible, but that the MeSH terms were at such a high level of generality (e.g., Evolution, Mutation, Mathematics) that almost any way of connecting them would make some sense. (He preferred the PFNET's tighter structure to the SOM's for this reason.) He thought only mappings based on a much more specific set of seed terms, e.g., the ecology of a particular species of African millipede, would have much value for him and his students.
This is a criticism with which many people might agree, and progress in bibliographic visualizations like ours may well lie in adding capabilities to map specific natural-language “co-words” from the titles, abstracts, or full texts of documents (8, 26, 27). Possibly the chief beneficiaries of MeSH (or other controlled-vocabulary) mapping will be neither beginners nor subject experts, but “in-between” persons, such as librarians, subject indexers, science writers, journal editors, and teachers as they browse the many research areas to which they come as outsiders.
Slatkin found his own cocited author maps readily interpretable. He was acquainted with every name that appears in Fig. 2. In the PFNET (which he again preferred), he identified the main structural feature, the clusters around himself and Masatoshi Nei, as representing two slightly different subject areas. Both the Nei group and the Slatkin group, he said, have contributed to the literature on genetic flow and population structure, but the Slatkin group has contributed relatively more to the literature on microsatellites (short, repetitive sequences of DNA). Hence, the PFNET was picking up a division he found meaningful.
Many combinations of linked names in Fig. 2A are coherent in the sense that they yield sensible internet retrievals. However, a stricter test for the coherence of a particular domain is whether an expert can rapidly and accurately predict why two authors are linked. Given a random pair from Fig. 2A (Nei and R. R. Sokal), Slatkin guessed that the link between them was caused by frequent cocitation of Sokal's book Biometry with Nei's works on the computation of standard genetic distance. A subsequent retrieval of the articles cociting the pair bore this out.
The SOM in Fig. 2B picks up some of the same dyadic structure as the PFNET, such as the connections between Ohta and Kimura, Tajima and Hudson, Takahata and Griffiths, Cavalli-Sforza and Jorde, and Valdes and Weber (which may reflect coauthorships as well as cocitation ties). Slatkin and Nei remain central figures, but are joined by Avise and Templeton. Interestingly, at the lower left the SOM conjoins Wright, Mayr, and Fisher, who represent the older, pioneering generation in statistical genetics. The SOM algorithm is able to bring this out solely on the basis of their overall cocitation profiles.
Other Reactions to the Map Types
If PFNETs seem directive about term relationships, SOMs are merely suggestive. However, their greater ambiguity is perhaps a virtue. Using authorlink, the forerunner of pnaslink, Buzydlowski (9) found that SOMs outperformed PFNETs in capturing the mental models of 20 experts in selected fields of the humanities. These were SOMs and PFNETs devoted to cocited authors, exactly like those in Fig. 2.
The experts' mental models were elicited by having them sort cards bearing authors' names into intuitively meaningful piles. Their task was to show how they would group, first, the 24 authors most highly cocited with Plato (almost all quite famous) and, second, the 24 authors most highly cocited with an individual author of the expert's choice. The matrices of card-sort groupings were compared with matrices of the groupings produced by PFNET linkages and SOM positionings. For both the Plato trial, which all experts participated in, and the individual-author trials, which were unique to each expert, SOMs agreed with the card-sort data better than PFNETs. In the Plato trial, both SOMs and PFNETs were highly correlated with the pooled card-sort data (SOMs, r = 0.97; PFNETs, r = 0.78), but these correlations were significantly different at P < 0.001. In the individual-author trials, a t test of mean agreement scores favored SOMs significantly at P < 0.01. The experts were nevertheless about equally divided in their preferences for one map type over the other.
In other, less formal trials, we have found that some experts object when maps of either type differ from their mental models of how the subject headings or authors in their fields are connected. With respect to this criticism, it should be borne in mind that the maps are pictures of the database. They show term associations that have developed as authors and indexers actually create literatures, in the present case, solely within PNAS, and these will often differ from the terminological hierarchies one finds in individual heads, not to mention textbooks, thesauri, or other databases (compare Table 2). In fact, the maps should be taken as new information, not as “erroneous” attempts to generate preexisting hierarchies from bibliographic data. The ongoing task is to find which types of maps and which types of terms are most useful to particular clienteles.
Footnotes
-
↵ * To whom correspondence should be addressed. E-mail: whitehd{at}drexel.edu.
-
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Mapping Knowledge Domains,” held May 9-11, 2003, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA.
-
Abbreviations: NLM, National Library of Medicine; ISI, Institute for Scientific Information; SOM, self-organizing map; PFNET, pathfinder network; MeSH, medical subject headings.
-
↵ † These data are extracted from Science Citation Index Expanded [Institute for Scientific Information, Inc. (ISI), Philadelphia, PA; Copyright ISI]. All rights reserved. No portion of these data may be reproduced or transmitted in any form or by any means without the prior written permission of ISI.
-
↵ ‡ For restricted access to mapping of ISI's full databases, contact X.L. at xlin{at}drexel.edu or H.D.W.
-
↵ ¶ Quoted from ref. 6, Copyright 2003, with permission from Elsevier.
- Copyright © 2004, The National Academy of Sciences







