## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Simplicial closure and higher-order link prediction

Edited by Duncan J. Watts, Microsoft Research, New York, NY, and accepted by Editorial Board Member Donald J. Geman October 12, 2018 (received for review January 13, 2018)

## Significance

Networks provide a powerful abstraction for complex systems throughout the sciences by representing the underlying set of pairwise interactions, but much of the structure within these systems involves interactions that take place among more than two nodes at once. While these higher-order interactions are ubiquitous, an evaluation of the basic properties and organizational principles in such systems is missing. Here we study 19 datasets from biology, medicine, social networks, and the web and characterize how higher-order structure emerges and differs between domains. We then propose a general framework for evaluating higher-order data models based on link prediction, a task in which we seek to predict future interactions from a system’s structure and past history.

## Abstract

Networks provide a powerful formalism for modeling complex systems by using a model of pairwise interactions. But much of the structure within these systems involves interactions that take place among more than two nodes at once—for example, communication within a group rather than person to person, collaboration among a team rather than a pair of coauthors, or biological interaction between a set of molecules rather than just two. Such higher-order interactions are ubiquitous, but their empirical study has received limited attention, and little is known about possible organizational principles of such structures. Here we study the temporal evolution of 19 datasets with explicit accounting for higher-order interactions. We show that there is a rich variety of structure in our datasets but datasets from the same system types have consistent patterns of higher-order structure. Furthermore, we find that tie strength and edge density are competing positive indicators of higher-order organization, and these trends are consistent across interactions involving differing numbers of nodes. To systematically further the study of theories for such higher-order structures, we propose higher-order link prediction as a benchmark problem to assess models and algorithms that predict higher-order structure. We find a fundamental difference from traditional pairwise link prediction, with a greater role for local rather than long-range information in predicting the appearance of new interactions.

Networks are a fundamental abstraction for complex systems and relational data throughout the sciences (1⇓–3). The basic premise of network models is to represent the elements of the underlying system as nodes and to use the links of the network to capture pairwise relationships. In this way, a social network can represent the friendships between pairs of people, a web graph can encode links among web pages or topic categories, and a biological network can represent the interactions among pairs of biological molecules or components (3⇓⇓–6). But much of the structure in these systems involves higher-order interactions between more than two entities at once (7⇓⇓⇓–11): People often communicate or interact in social groups, not just in pairs; associative relations among ideas or topics often involve the intersection of multiple concepts; and joint protein interactions in biological networks are associated with important phenomena (12).

These types of higher-order, group-based interactions are apparent even in the standard genres of datasets used for network analysis. For example, coauthorship networks are built from data in which larger groups write papers together, and similarly, email networks are based on messages that often have multiple recipients. While such higher-order structure is not captured by the topology of a graph, it may be modeled via a collection of formalisms that include set systems (13), hypergraphs (14), simplicial complexes (15), and bipartite affiliation graphs (7, 16). Despite the existence of mathematical formalisms for higher-order structure, there is no unifying study that analyzes the basic higher-order structure of such datasets. This is in sharp contrast to other notions of “higher-order models” generalizing graph data, such as multiplex networks (17) and higher-order Markov chain models (18, 19), which are successful but still rooted in a pairwise representation paradigm. We study the complementary direction of group interactions, as outlined in the examples above, and use the term higher-order model in this sense.

A key reason for the lack of large-scale studies in higher-order models is that data are often collected directly in a network format, thus eliminating higher-order interactions already at the data-collection stage. Another reason is that analyzing higher-order interactions can be computationally challenging for large datasets. Consequently, despite their potential importance, little is known about organizational principles of higher-order structures within real-world datasets. For instance, one question that remains to be answered is whether higher-order interactions enable us to differentiate different kinds of datasets or whether higher-order properties are universal across datasets.

Here, we provide steps in the direction of promoting a broad, rigorous study of higher-order topological interactions across domains. To this end, we study the structure and temporal evolution of 19 datasets from a variety of domains that have higher-order interactions. We find that distinct patterns for different domains are immediately revealed with three-way interaction features that are not available from the graph structure of the networks alone.

Motivated by the importance of triangular structures in network clustering and the theory of triadic closure in social networks (4, 20), we study an extension of this theory via simplicial closure or the way in which groups of nodes evolve until eventually coappearing in a higher-order structure. In this case, we find that strong previous interactions between subsets of a group increase the likelihood of a simplicial closure event, where the nodes appear in a group together. The relative importance of different types of prior interactions depends on the dataset yet remains consistent when considering groups of different sizes for a given dataset. To facilitate future modeling and demonstrate that the higher-order patterns are not simple epiphenomena of the underlying link structure, we introduce a higher-order link prediction problem—the forecasting of future higher-order interactions—as an evaluation framework for models and algorithms that aim to predict the emergence of higher-order structure from existing data.

## Structural Analysis of Higher-Order Networks

We assembled a diverse collection of 19 datasets, recording the timestamped interactions of groups of entities. Thus, each dataset is a set of timestamped sets of nodes. We call each set of nodes a simplex, and the nodes in each simplex take part in a shared interaction at a given timestamp (Fig. 1*A*). For example, in a coauthorship network, a simplex corresponds to a set of authors publishing an article at a given time.

Formally, each dataset consists of N timestamped simplices, *k* − 1)-simplex in algebraic topology, and the set of all its pairs is called a *k*-clique in graph theory.] This set-based representation provides a natural format for datasets from a range of domains. We briefly describe our datasets below (see *SI Appendix* for more complete descriptions).

• Coauthorship data (coauth-DBLP; coauth-MAG-History; coauth-MAG-Geology): Nodes are authors and a simplex is a publication; DBLP spans over 80 years and the other two datasets span about 200 years.

• Online tagging data (tags-stack-overflow; tags-math-sx; tags-ask-ubuntu): Nodes are tags (annotations) and a simplex is a set of tags for a question on online Stack Exchange forums; the data contain the complete history of the forums.

• Online thread participation data (threads-stack-overflow; threads-math-sx; threads-ask-ubuntu): Nodes are users and a simplex is a set of users answering a question on a forum; again, the data contain the complete history of the forum.

• Drug networks from the National Drug Code Directory (NDC-classes): Nodes are class labels (e.g., serotonin reuptake inhibitor) and a simplex is the set of class labels applied to a drug (all applied at one time). (NDC-substances): Nodes are substances (e.g., testosterone) and a simplex is the set of substances in a drug; datasets include the complete history of the directory.

• US. Congress data [congress-committees (21); congress-bills (22)]: Nodes are members of Congress and a simplex is the set of members in a committee or cosponsoring a bill; the committees dataset spans 1989–2003 and the bills dataset spans 1973–2016.

• Email networks [email-Enron (23); email-Eu (24)]: Nodes are email addresses and a simplex is a set consisting of all recipient addresses on an email along with the sender’s address; email-Enron spans most of the duration of a company’s lifetime, and email-Eu spans over 2 years.

• Contact networks [contact-high-school (25); contact-primary-school (26)]: Nodes are persons and a simplex is a set of persons in close proximity to each other.

• Drug use in the Drug Abuse Warning Network (DAWN): Nodes are drugs and a simplex is the set of drugs reportedly used by a patient before an emergency department visit.

• Music collaboration (music-rap-genius): Nodes are rap artists; simplices are sets of rappers collaborating on songs.

To provide uniformity across datasets, we restrict to simplices consisting of at most 25 nodes. This is relevant to, e.g., the coauthorship data in which large consortia of hundreds of authors collaborate on a single paper. However, such events are rare and not relevant for our analysis. Table 1 lists some summary statistics of the datasets. The number of unique simplices appearing in the data is minuscule compared with the total number of possible simplices. For example, in the dataset with the smallest number of nodes (email-Enron, 143 nodes), there are nearly 500 million possible simplices of size at most 5, whereas only 1,542 unique simplices appear in the dataset. On the other hand, in most datasets, the number of unique simplices is within an order of magnitude of the number of pairs of nodes that coappear in some simplex (edges in the projected graph; discussed in the next section).

### Higher-Order Features Reveal Rich Structural Diversity.

Our data representation distinguishes between the observations of different kinds of k-way interactions between a set of entities. Stated differently, unlike in a graph representation, we do not break down each simplex into a set of (induced) pairwise interactions. Although the specific representation is not essential provided the information of the group interaction is faithfully encoded, it is convenient to think of our data as an abstract simplicial complex as depicted in Fig. 1*B*.

The simple encoding of the observed information as a graph is called the projected graph. Formally, in the projected graph, two nodes are joined by an edge of weight w if they coappear in w simplices (Fig. 1*C*). A k-clique in the projected graph is a set of nodes among which an edge is present between all pairs. k cliques appear if (*i*) the k nodes were all part of some simplex or (*ii*) each pair was part of some simplex, although all k were never part of the same simplex. In the former case, we say the k nodes form a closed clique, while in the latter case we say they form an open clique.

We first study the occurrence of open and closed 3-cliques or triangles (Fig. 2). This is the simplest higher-order structure present in our datasets that is not captured by a graph. Furthermore, triangles are one of the most important structural patterns in network analysis (4, 8, 27). As discussed above, there are two types of triangles which cannot be distinguished by the weighted projected graph alone. In a closed triangle, all three nodes have coappeared in at least one simplex. Formally,

Every simplex with at least three nodes directly creates a closed triangle, while open triangles appear coincidental. Moreover, larger simplices lead to many closed triangles: For instance, a k-node simplex contributes *B*–*E*).

While the distribution of simplex sizes is broadly similar in most datasets (Fig. 2*A*), jointly analyzing the edge density in the projected graph with the fraction of triangles that are open reveals a rich landscape of datasets (Fig. 2*B*): (*i*) low density with a small fraction of triangles open (coauthorships and music collaboration), (*ii*) low density with a large fraction of triangles open (Stack Exchange threads), (*iii*) high density with a large fraction of triangles open (Stack Exchange tags, contact, bill cosponsorship), and (*iv*) high density with a medium fraction of triangles open (email, Congress committee membership, NDC substances and classes). These results are not skewed by large simplices—the landscape is broadly preserved when restricting to the three-node simplices (Fig. 2*D*).

Measuring average unweighted degree along with fraction of open triangles also reveals substantial diversity, and datasets from the same domain continue to exhibit similar features (Fig. 2*C*). Restricting the data to only three-node simplices, we find a near-linear relationship between the fraction of open triangles and the log of the average degree (Fig. 2*E*). A linear model of the data in Fig. 2*E* has *D*. This suggests that larger simplices bring diversity to the data.

### Higher-Order Egonet Features Discriminate System Domains.

The structural diversity of the datasets is also present at the local level of egonets (1-hop neighborhoods of nodes), and local statistics can identify the “system domain” of datasets. By system domain, we simply mean the categories identified in Fig. 2 that correspond to datasets recorded from the same kind of system. Our collection of datasets has five clear system domains with at least two datasets each: coauthorship, online tags, online thread coparticipation, email, and proximity contact. Using a multinomial logistic regression model to determine system domain with the fraction of triangles that are open and log of the average degree as covariates reveals clustering structure of the system domains (Fig. 3). This simple model can predict system domain with nearly 75% accuracy, compared with approximately 21% accuracy with random guessing. The prediction accuracy provides evidence that there are different organizational mechanisms at play locally for different systems. In conjunction with the structure illustrated in Fig. 2, this suggests that there is not a single “universal” setting of values for simplicial network statistics; the context underlying the network matters, but within a given context the parameters are quite stable.

We also trained models with the log of the edge density as a covariate, in addition to the log of the average degree and the fraction of triangles that are open; model accuracy mildly increased from 75% to 78% (Table 2). However, discarding the log of the average degree as a covariate decreases model accuracy to 60%, and including only edge density and average degree without the fraction of triangles that are open decreases model accuracy to 50%. The accuracy numbers are guides in how to model higher-order interaction data. For example, we conclude that the fraction of triangles that are open—a network statistic that relies on knowledge of the higher-order structure in the dataset—is a valuable covariate for identifying system domains. Thus, simple higher-order interactions should be used when analyzing or modeling such data. Furthermore, the average degree tends to be more valuable than edge density when considering local organizational mechanisms.

### A Simple Generative Model for Open and Closed Triangles.

We have now seen that there is diversity in datasets from global network statistics and that local statistics reveal system domains of the networks. We now provide a simple generative model of simplices that helps describe how diversity in the datasets might arise. The model uses the hypothesis that three-node simplices form independently with a fixed probability. While extreme, this hypothesis indeed leads to diversity in the fraction of triangles that are open. To see this, suppose that a dataset consists only of three-node simplices on n nodes, and any set of three nodes

There are two asymptotic regimes here depending on the value of b. If

We can also use the above procedure to construct datasets with a smaller edge density, while keeping the average degree fixed by patching together c replicates of one of these random datasets; this creates a dataset with c times as many nodes, but the same average degree. More formally, if a dataset with n nodes has average degree d and edge density ρ, then the union of c copies of this dataset has *B* and *D*, but this does not imply that our data were generated by this model.

## Temporal Dynamics and Simplicial Closure Events

The above analysis already reveals useful information about the organization of closed and open triangles, and studying the temporal dynamics of the networks in detail offers additional insights. A possible hypothesis for strong prevalence of open triangles would be temporal asynchrony in link creation. For example, consider three Congresspersons u, v, and w in the committee membership dataset, where u is in one committee with v and in another committee with w. If u is not reelected, there will be no opportunity for the triple of nodes to form a closed triangle, as u has effectively become inactive. An open triangle may still form if v and w are on the same committee in a future Congress. However, we find that temporal asynchrony does not explain most open triangles. Depending on the dataset, the three edges in 61.1–97.4% of open triangles have an overlapping period of activity (including 89.5% for Congress committees; *SI Appendix*).

Regardless of how open triangles are created, the three associated nodes may of course appear together in a simplex in the future as the network evolves. Deviating from our above simple model of independent creation of closed triangles, we find that many newly formed simplices in our data consist of k nodes that had previously constituted an open k-clique in the projected graph. We say that the appearance of a new simplex containing these k nodes is an instance of a simplicial closure event, i.e., the conversion of an open structure to a closed one, as illustrated in Fig. 1*D*. [Here we are building on terminology for datasets of static sets of simplices (28). The term “simplicial closure” also appears in the combinatorial topology literature but with a different meaning (29).] In the following, we investigate the simplicial closure mechanism as an organizational principle for higher-order interactions.

### Simplicial Closure on Triangles Reveals Competing Features.

Although conceptually similar, three nodes participating in a simplicial closure event is distinct from the well-known phenomenon of triadic closure events in social networks (4). A triadic closure event modifies the structure of the underlying pairwise interactions, whereas a simplicial closure event adds a new higher-order interaction without necessarily changing the pairwise structure of the projected graph.

Any induced subgraph on three nodes in the weighted projected graph can change several times before the three nodes appear in a simplex together, i.e., go through a simplicial closure event (Fig. 5). We call this the “lifecycle” of the triple of nodes. There are two changes that a triple of nodes can undergo during its lifecycle before a simplicial closure event. First, a new pairwise link can be added between two nodes u and v. This corresponds to an increase in density in this induced subgraph; e.g., the introduction of the drug Promacta adds an edge in Fig. 5*B*. Second, the projected graph edge weight between nodes u and v can increase, which we interpret as an increase in tie strength. For instance, in Fig. 5*C*, the tie strength between Gucci Mane and Young Thug increases after they collaborate on “Fell.” To simplify our analysis, we differentiate only between weak ties corresponding to a single interaction (*A*).

To get a first impression of the magnitude of these events, we examine the lifecycle of every triple of nodes that becomes an open or closed triangle in the coauth-MAG-History dataset (Fig. 5*A*). In this dataset, a closed triangle is more likely to have come from a configuration with exactly two strong tie edges (3,171 cases) than from an open triangle (328 + 779 + 722 + 285 = 2,114 cases). Most closed triangles are formed by nodes that had no previous interaction (2,732,839 cases); however, since the graph is sparse, the fraction of triples of nodes with no prior engagement that go through a simplicial closure event is small (*SI Appendix*). Additionally, if three nodes induce an open triangle with only weak ties at some point in time, then the three nodes are more likely to gain a strong tie before closure (445 cases) than to close directly from that state (328 cases).

We also analyze the probability of a simplicial closure event conditioned on the state of the three nodes in its lifecycle. To do so, we split each dataset based on the temporal order of appearance of the simplices into a training set, consisting of the first 80% of the simplices (in time) and a test set of the remaining 20% of the simplices. Formally, if *SI Appendix* contains all of the simplicial closure event probabilities).

We highlight four important findings. First, the simplicial closure event probability typically increases with additional edges (Fig. 6*A*). In other words, as the edge density of the subgraph induced by the three nodes increases, the probability of a simplicial closure event increases. We formally test this by comparing the closure probability of a fixed weighted induced subgraph configuration and the same configuration with an additional unit-weight edge for all suitable cases. The latter has a statistically significant larger simplicial closure event probability in 102 of 113 cases over all datasets and pairs of configurations, whereas the less dense structure is never significantly more likely to close (*Materials and Methods*). (Our goal here is to illustrate general trends rather than to find a single statistically significant result.) This result is consistent with both theoretical (4) and empirical (30) studies of dyadic link formation in social networks. However, several of our datasets are not social networks.

Second, the probability of a simplicial closure event typically increases with tie strength (Fig. 6*B*). We test the effect of tie strength by comparing the closure probability of a fixed weighted induced subgraph containing at least one weak tie and the same configuration where the weak tie is converted to a strong tie. Increasing the tie strength significantly increases the probability of a simplicial closure event in 82 of 113 cases over all datasets and significantly decreases the closure probability in just 6 of 113 cases (

Third, neither edge density nor tie strength dominates the likelihood of simplicial closure events (Fig. 6*C*). In the coauthorship and Congress datasets, an open triangle composed of three weak ties is more likely to close than a three-node subgraph with just two strong ties. The reverse is true for the stack exchange tags and stack exchange threads datasets. Overall, the open triangle of weak ties is significantly more likely to close than the three nodes with two strong ties in 4 of 19 datasets, whereas the opposite is true in 6 of 19 datasets (

Fourth, the results reveal varying closure dynamics over the dataset domains. In human social interactions, simplicial closure events appear to be driven by a topological form of triadic closure: Mutual acquaintance between all of the nodes in a set increases the probability of a joint interaction. In contrast, simplicial closure events in the discussion platform networks resemble transitive closure: Once there is a sufficiently strong co-occurrences of tags, they become likely to be used together.

A possible concern with our analysis is that we measured closure probabilities only at one point in time for each dataset. Furthermore, while some of our datasets represent a complete history of the network (tags, threads, NDC) and some span a long duration of time (coauthorship, music, congress-bills), a few contain only a slice of the underlying network’s dynamics (email-Eu, contact). However, we find that the closure probabilities and the results on edge density and tie strength are consistent at different points in time (*SI Appendix*).

### Simplicial Closure Properties Extend Beyond Triangles.

All four of the above findings hold for simplicial closure events on four nodes, so our results are not limited to structure on three nodes (Fig. 6 *D*–*F*). Now, a simplicial closure event is all four nodes appearing in a simplex, and tie strength is measured on three-node simplices, i.e., how often the three-node subsets of a four-node structure have appeared together in a simplex (0, “open”; 1, “weak”; or at least two times, “strong”).

To measure the effect of edge density, we compare the closure probability of a configuration consisting of a fixed number of edges to the closure probability of the same configuration with an additional edge, keeping the tie strengths fixed (Fig. 6*D* shows one such comparison). In 180 of 228 applicable comparisons over all datasets, the closure probability significantly increases with the edge density and significantly decreases in only 2 cases (*E* shows a case where the tie strength increases from open to weak). The closure probability significantly increases with simplicial tie strength in 26 of 38 cases for three-edge configurations, 31 of 38 cases for four-edge configurations, 77 of 114 cases for five-edge configurations, and 177 of 359 cases for six-edge configurations, compared with a significant decrease in closure probability in just 2 of 38, 1 of 38, 1 of 114, and 4 of 359 cases (

There is also tension between the influence of sparser configurations with strong ties and that of denser configurations with weak ties. Fig. 6*F* shows one such comparison. In this case, three of five datasets for which edge density is significantly more indicative than tie strength in the three-node comparison of Fig. 6C, edge density is also significantly more important in the four-node case (

## Higher-Order Link Prediction

Thus far, we have shown that higher-order interactions provide a rich source of additional information beyond traditional network modeling. Our analysis leaves open many questions, such as the development of better mechanistic models for the emergence of these interactions. To facilitate this process, we propose an analog of link prediction for higher-order structure.

### Model Evaluation Framework.

The basic premise in link prediction—whether pairwise or higher order—is to use structural network properties up to some time t to predict the appearance of new interactions after t. In traditional network analysis, link prediction is a cornerstone problem and a highly successful evaluation framework for comparing different models via a well-calibrated prediction task (32, 33). Specifically, link prediction examines data that evolve over time and sees how well a given model predicts the appearance of new links—for example, new coauthorships appearing in a coauthor network or new messages between pairs of people in an email network.

In this context, a model is interpreted broadly and may be mechanistic [e.g., preferential attachment (34)], statistical [e.g., probabilistic hierarchical models (35)], or implicitly encapsulated by a principled heuristic algorithm. For instance, personalized PageRank is a model capturing the fact that a large number of walks between two nodes drive up the connection probability between them (32). A key advantage of link prediction as an evaluation framework is precisely that it can handle these various kinds of models. This holds even in the absence of a likelihood expression, which would be required for a more standard statistical evaluation of goodness of fit. While ultimately we may want to arrive at a generative, causal description of the emergence of higher-order patterns, the flexibility of link prediction enables us to probe the importance of features of the network data in a simple manner without having to create a formal statistical model.

Link prediction has proved valuable for methodological reasons and also in concrete applications. Methodologically, asking whether one model is better than another at predicting new links provides a data-driven way of assessing the effectiveness of the models (32, 36, 37). Link prediction also has a number of direct applications that cut across disciplines, including predicting friendships in social networks (38), inferring new relationships between genes and diseases (39), and suggesting novel connections in the scientific community (40).

Link prediction is also used within model selection tools for evaluating community detection algorithms (41, 42). In these cases, link prediction may be interpreted as the smallest possible test for the fit of a model as we need to predict only one edge at a time. However, if one were to consider all edges in a cross-validation assessment, good link prediction performance indicates a good model fit for other structures in the data. Our higher-order link prediction task probes a larger set of features, in that it requires us to be able to predict more aspects of the data (any higher-order interaction, in principle).

For simplicity of presentation and scalability reasons, we predict simplicial closure events on triples of nodes. Thus, the higher-order link prediction problem examined here is predicting which triples of nodes that have not yet appeared in a simplex together will be a subset of some simplex in the future. Our above analysis suggests that open triangles or triples of nodes with strong ties are the most likely to close in the future. For our experiments, we predict which open triangles will go through a simplicial closure event in the future. Thus, this is a problem completely ignored by traditional link prediction, which would just view the triangle as already part of the graph. From a computational view, this restriction also makes it feasible to enumerate all open structures upon which the algorithms will make a prediction, using only modest computational resources. Thus, we avoid a common problem in link prediction of how to pare down an enormous candidate set of potential links, which itself is an active research topic (43, 44).

### Simple Local Features Predict Well.

We first split the data into training (first 80% of simplices in time) and test (final 20%) sets. Then, we evaluated the prediction performance of several models (several inspired from classical link prediction) on each dataset by the area under the precision-recall curve (AUC-PR) metric (Table 3). We use random scores as a baseline, which, with respect to AUC-PR, corresponds to the proportion of open triangles in the training set that go through a simplicial closure event in the test set.

We compare eight models here and provide additional comparisons in *SI Appendix*. Three are heuristics based on our finding that tie strength is indicative of closure; these are the harmonic, geometric, and arithmetic means of the three edge weights in the open triangle. Two more are based on the Adamic–Adar model (45) and the preferential attachment model. The latter has been suggested as a growth mechanism of coauthorship networks (20, 34). Two are based on longer path counts (Katz and personalized PageRank), which are models known for providing good prediction in dyadic link prediction (32). Finally, we use a supervised logistic regression model based on features from the other models.

No single model performs the best over all datasets, but our proposed baseline algorithms can achieve much better performance than randomly guessing which open triangles go through a simplicial closure event. In the threads datasets, we achieve between one and two orders of magnitude performance improvements with the harmonic and geometric means, which indicates that local tie strength is relatively more important for these datasets than for others. The absolute performance of the algorithms is far from perfect (*SI Appendix*), as the higher-order link prediction is challenging. This finding is consistent with recent research on subgraph prediction in pairwise networks (46). However, our goal here is to identify some of the important structural features of the problem, rather than to predict with perfect accuracy.

The harmonic and geometric means of edge weights perform well across many datasets, which further highlights the importance of tie strength in predicting simplicial closure events. This finding is fundamentally different from traditional link prediction with pairwise interactions (i.e., for the edges in a graph). In traditional link prediction, a key principle is that it is valuable to use information contained in paths of nontrivial length between two nodes u and v for predicting a link between them—for example, PageRank and Katz measures are effective (32, 33). In this sense, higher-order link prediction is fundamentally more local in its overall structure. This arises from the ability of a k-tuple of nodes, for

The arithmetic mean performs the worst of the three means in all but one dataset. We further analyze the performance of edge weight means using the generalized mean with parameter p as score functions: *i*) unimodal in p, (*ii*) maximized for *iii*) better for *C*). Therefore, a single strong edge could provide the signal for closure, in which case a larger p is a better score function.

The supervised learning approach also performs well broadly, especially in the larger datasets such as the coauthorship datasets, which have sufficient training data to learn a good model. However, even when including the features of the other models, the method does not always perform the best. This is likely a case of overfitting (47). In the case of the congress-bills data, the supervised method captures a unique feature of this dataset—nodes appearing in fewer simplices are more likely to go through a simplicial closure event. This is possibly due to the ambition of junior Congresspersons. The fact that combinations of features prove effective in many domains highlights the richness of the underlying problem, and the array of methods and findings presented here can guide progress on better models.

## Discussion

The dyadic network modeling paradigm has been successful but fails to capture natural higher-order interactions. Here, we established the foundation for analyzing the basic structure of temporal networks with higher-order structure. We found rich structural variety in our datasets in terms of the fraction of triangles that are open, the average degree, and the edge density. Local statistics at the level of egonets can identify system domain, which suggests that these features are key to the organizing principles of the systems. Recent research shows the small fraction of triangles that are open in coauthorship networks (28); our results are consistent but reveal that open triangles are extremely common in other domains. Prior research has also identified the distinction between open and closed triangles when projecting bipartite networks but has not studied the idea of simplicial closure events (7, 48).

We found that common principles from dyadic network evolution also hold for higher-order structure; namely, tie strength and edge density are positive indicators of simplicial closure events among sets of three and four nodes. However, there is tension between these features—the more influential feature depends on the dataset, suggesting different mechanisms for simplicial closure events. For example, edge density matters more in human interaction, but tie strength matters more for tagging on online discussion platforms.

Higher-order link prediction provides a general methodology for evaluating models in any data where higher-order structure evolves over time, such as predicting which sets of authors will write a paper together or which sets of people will appear as joint recipients on an email. We anticipate that higher-order link prediction will validate emerging higher-order network modeling techniques, such as multipartite networks (49), metapaths (50), and embeddings (51), and connect to ideas in computational topology, such as random walks on simplicial complexes (52, 53). Related higher-order models for different data (18, 19) can also use higher-order link prediction for model evaluation. For example, in the absence of temporal information, higher-order link prediction could be used to find missing data, similar to how dyadic link prediction can find missing data in static networks (35). Our higher-order link prediction framework also provides a way to study more sophisticated models where the underlying network is also dynamic, e.g., with arrival and departure of nodes. Specifically, such models should be able to predict higher-order links.

Our prediction problem examined a structure that is not even considered in traditional network analysis, where no distinction is made between open and closed triangles. From this setup, we found that simple local measures (generalized means of edge weights) are effective predictors. This finding differs from traditional link prediction, where long paths are important (32), and suggests that the temporal evolution of higher-order network data is fundamentally different from dyadic network evolution.

## Materials and Methods

### System Domain Prediction from Egonet Statistics.

We computed (*i*) the fraction of open triangles, (*ii*) the log of the average degree in the projected graph, and (*iii*) the log of edge density in the projected graph of 100 egonets sampled uniformly at random (without replacement) from all egonets containing at least one open or closed triangle in each of 13 datasets categorized as coauthorship, stack exchange tags, stack exchange threads, email, or contact. Using 80 samples from each of the 13 datasets as training data, we trained an

### Hypothesis Testing for Simplicial Closure Event Probabilities.

Let

### Data and Software.

Data collection details are in *SI Appendix*. Software is available at https://github.com/arbenson/ScHoLP-Tutorial. Datasets have been deposited in the GitHub repository, https://github.com/arbenson/ScHoLP-Data.

## Acknowledgments

We thank Mason Porter and Peter Mucha for providing the Congress committees dataset. We thank Paul Horn, Gabor Lippner, and Jarosław Błasiok for helpful discussion. This research was supported in part by a Simons Investigator Award. A.R.B. received funding from NSF Award DMS-1830274. R.A. was supported in part by a Google scholarship and a Facebook scholarship. A.J. received funding from the Vannevar Bush Fellowship from the office of the Secretary of Defense. M.T.S. received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant 702410.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: kleinber{at}cs.cornell.edu.

Author contributions: A.R.B., R.A., M.T.S., A.J., and J.K. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. D.J.W. is a guest editor invited by the Editorial Board.

Data deposition: Datasets have been deposited in the GitHub repository, https://github.com/arbenson/ScHoLP-Data. The software has been deposited in the GitHub repository, https://github.com/arbenson/ScHoLP-Tutorial.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1800683115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- Easley D,
- Kleinberg J

- ↵
- ↵
- ↵
- Deane CM,
- Salwiński Ł,
- Xenarios I,
- Eisenberg D

- ↵
- ↵
- Newman MEJ,
- Watts DJ,
- Strogatz SH

- ↵
- Milo R, et al.

- ↵
- Ugander J,
- Backstrom L,
- Marlow C,
- Kleinberg J

- ↵
- Benson AR,
- Gleich DF,
- Leskovec J

- ↵
- ↵
- ↵
- Frankl P

- ↵
- Berge C

- ↵
- Hatcher A

- ↵
- ↵
- ↵
- Xu J,
- Wickramarathne TL,
- Chawla NV

- ↵
- ↵
- Newman MEJ

- ↵
- Porter MA,
- Mucha PJ,
- Newman MEJ,
- Warmbrand CM

- ↵
- Fowler JH

- ↵
- Klimt B,
- Yang Y

- ↵
- Paranjape A,
- Benson AR,
- Leskovec J

- ↵
- ↵
- ↵
- Kossinets G,
- Watts DJ

- ↵
- Patania A,
- Petri G,
- Vaccarino F

- ↵
- Bertrand G

- ↵
- Leskovec J,
- Backstrom L,
- Kumar R,
- Tomkins A

- ↵
- Backstrom L,
- Huttenlocher D,
- Kleinberg J,
- Lan X

- ↵
- ↵
- ↵
- ↵
- ↵
- Grover A,
- Leskovec J

- ↵
- Santolini M,
- Barabási AL

- ↵
- Backstrom L,
- Leskovec J

- ↵
- ↵
- Tang J,
- Wu S,
- Sun J,
- Su H

- ↵
- Ghasemian A,
- Hosseinmardi H,
- Clauset A

- ↵
- Kawamoto T,
- Kabashima Y

- ↵
- Ballard G,
- Kolda TG,
- Pinar A,
- Seshadhri C

- ↵
- Sharma A,
- Seshadhri C,
- Goel A

- ↵
- Adamic LA,
- Adar E

- ↵
- Meng C,
- Mouli SC,
- Ribeiro B,
- Neville J

*AAAI Conference on Artificial Intelligence*. Available at https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16941. Accessed October 24, 2018. - ↵
- Friedman J,
- Hastie T,
- Tibshirani R

*The Elements of Statistical Learning*, Springer Series in Statistics (Springer, New York), Vol 1. - ↵
- ↵
- Lind PG,
- Herrmann HJ

- ↵
- Sun Y,
- Han J,
- Aggarwal CC,
- Chawla NV

- ↵
- Goyal P,
- Ferrara E

- ↵
- Mukherjee S,
- Steenbergen J

- ↵
- Parzanchevski O,
- Rosenthal R

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Computer Sciences