# Block models and personalized PageRank

See allHide authors and affiliations

Edited by Ronald L. Graham, University of California, San Diego, La Jolla, CA, and approved November 14, 2016 (received for review July 12, 2016)

## Significance

Methods based on PageRank have been fundamental to work on identifying communities in networks, but, to date, there has been little formal basis for the effectiveness of these methods. We establish a surprising connection between the personalized PageRank algorithm and the stochastic block model for random graphs, showing that personalized PageRank, in fact, provides the optimal geometric discriminant function for separating the communities in stochastic block models over a wide class of functions. Building on this result, we develop stronger classifiers that, although scalable, are competitive with computationally much more demanding methods such as belief propagation.

## Abstract

Methods for ranking the importance of nodes in a network have a rich history in machine learning and across domains that analyze structured data. Recent work has evaluated these methods through the “seed set expansion problem”: given a subset

The challenge of contextually ranking nodes in a network has emerged as a problem of canonical significance in many domains, with a particularly rich history of study in social and information networks (1⇓⇓–4). An active line of recent work has focused on the problem of “seed set expansion” in networks (5⇓⇓⇓⇓⇓–11), a fundamental version of node ranking with the following natural definition.

In the seed set expansion problem, we are given a graph

A recent focus in the work on this problem has been the power of approaches based on random-walk methods, including versions of “personalized PageRank” (11⇓–13) and physical analogs based on the heat equation for graphs (14, 15). These techniques can be viewed as operating on the following quantities: for each node

These methods are elegant in their formulation and have also shown to be both quite powerful and scalable (15, 17, 18). At the same time, the success of these methods has left open a number of very basic questions. In particular, if we think of the “landing probabilities”

Motivations for the specific form of these two scores have come from several domains. These include the “random surfer model” for PageRank (1) consisting of a mixture of random-walk steps and random jumps, as well as results connecting both PageRank and heat kernel quantities to bounds on sparse cuts (19⇓⇓–22) and regularized solutions to the min-cut problem (23). These latter methods suggest optimization formulations different from our results here; we discuss the comparison in more detail in *SI Appendix*. Finally, previous work has also shown that the two-step landing probability

Is there a principled reason why the expressions for PageRank or the heat kernel represent the “right” way to combine the information coming from random walks, or could there be better approaches? Furthermore, is there a formal framework available for deriving or at least motivating effective ways of combining random walk probabilities? Given the diverse and important applications where PageRank and heat kernel methods have seen successes, we consider a broader examination of the space of methods for combining available random walk information, appreciating that the approaches in existing work are simply particular points in that space.

The key observation we pursue in this work is that a basic model of separable structure in graphs known as the “stochastic block model” (SBM) (26) can be used to model the presence of a seed set in the graph, allowing us to derive principled methods for ranking nodes in the space of landing probabilities. We focus on a two-block SBM, where one block of nodes corresponds to the community of a labeled seed set, whereas the other block corresponds to its complement, the remainder of the graph. In this setting, the problem of finding the hidden community of interest has a correct answer with respect to an underlying graph generation process, and hence methods for combining landing probabilities of random walks can be evaluated according to their ability to find this answer.

For this two-block SBM, we first derive the centroids, for each block, in the space of landing probabilities. Studying this space, we make the surprising observation that the optimal hyperplane for performing a linear sweep between the two centroids is asymptotically concentrated for large graphs on the weights of personalized PageRank, for a specific choice of the PageRank parameter corresponding to parameters of the SBM. This connection between personalized PageRank and SBMs is a bridge between two otherwise disconnected literature and gives a strong motivation for using personalized PageRank in contextual ranking and seed set expansion problems.

Beyond simple linear discriminant rules, we observe block models can be used to propose more advanced scoring methods in the space of landing probabilities, and our analysis points to important ways in which personalized PageRank can be strengthened. Although its geometric orientation is optimal with respect to the landing probability centroids, personalized PageRank does not account for the variance or covariance of these landing probabilities (e.g., how the two-hop landing probabilities from a given seed correlate with the three-hop probabilities from that seed). We derive weights that correctly incorporate these variances and covariances and show that for the SBM benchmark, this richer family of measures significantly outperforms personalized PageRank.

## Discriminant Functions for SBMs

The stochastic block model (26), also known as the planted partition model (27, 28), is a distribution over graphs that generalizes the Erdős–Rényi random graph model

An SBM is thus completely described by the parameters

We will first place particular focus on two-block models (

To perform our classifications, we will focus on two particular classes of discriminant functions: geometric discriminant functions and Fisherian discriminant functions. “Geometric discriminant functions” perform a linear sweep through the feature space from one centroid

### Geometric Discriminant Functions.

Recall that

Let

### PageRank and SBMs: Two Identical Communities.

We now establish an asymptotic equivalence between personalized PageRank and geometric classification of SBMs in the space of landing probabilities, the main theoretical result of our work. We begin by stating and proving our results for the special case of an SBM with two identical communities. We then state the more general connection between personalized PageRank and the geometric discriminant weight vector for SBMs with *SI Appendix*.

**Proposition 1.** *Let* *be an* *-node graph generated from a two-block SBM* *with equally sized communities* (*and with* *and* *Let* *be the geometric discriminant weight vector in the space of landing probabilities* (*-step through* *-step,* *fixed) between the empirical block centroids* *and *

*For any* *, there exists an* *sufficiently large such that* *with probability at least* *, where:*

*(a)* *is a vector with the* *coordinate specified by* *, where* *and *

*with initial conditions*

*and*

*(b) This recurrence relation can be solved exactly, leading to a closed-form expression for* *:**and thus the geometric discriminant weight vector* *is asymptotically equivalent to* *for* *.*

The above proposition relies on the following lemma.

**Lemma 1** (**concentration for C=2 identical blocks).**

*Let *.

*For any *

*where*

*and*B k are the solutions to the matrix recurrence relation

*with*

*and*B 0 = 0 .

A proof of the lemma is given in *SI Appendix*. With the lemma as given, we now prove Proposition 1.

*Proof (of Proposition 1)*. First, we will use the lemma to show that the coordinates of the weight vector

As a result, whenever this containment holds we have that

This expression can be simplified by recasting the central quantities in terms of the geometric discriminant weight vector

This expression furnishes us with a coordinate-wise bound for each of the

We have established the asymptotic equivalence between the geometric discriminant weight vector **1**. Next, we identify a closed-form expression for the diagonalization of **1** then simplify to

Finally, we have that

Personalized PageRank with

A few remarks are in order. First, the factor

The simple expression

### PageRank and SBMs: *C* Blocks.

We now state a second proposition that offers a more general connection between the geometric discriminant weight vector for arbitrary SBMs with *SI Appendix*. In the

**Proposition 2.** *Let *.

*For any *

*(a) *.

*(b) If *D

*linear homogeneous matrix recurrence and the closed-form solution for*Ψ follow from diagonalizing the corresponding 2 × 2 matrix.

*(c) Assuming that the blocks are identically distributed, then**in which case, the geometric discriminant weight vector is asymptotically equivalent to personalized PageRank*.

As with Proposition 1, this proposition makes heavy use of a concentration lemma analogous to Lemma 1, stated and proven as Lemma 2 in *SI Appendix*. We also use an additional lemma, Lemma 3, also stated and proven in *SI Appendix*. The result establishes ways in which the optimality of personalized PageRank holds even when the two communities have nontrivial substructure.

A major consequence of Proposition 2 is that the asymptotically optimal geometric discriminant function for an arbitrary SBM, without any specification of balanced block sizes or expected degrees, can be derived from an uncomplicated

In Fig. 1, we see that the asymptotic centroids show near-perfect agreement with the empirical centroids for an example block model with *SI Appendix*). We also see the empirical variance of the coefficients can be highly nonuniform, with one-step landing probabilities exhibiting much greater variance than landing probabilities after subsequent steps. This nonuniform variance motivates our next approach, where we explicitly consider these heterogeneous variances, as well as covariances between the landing probabilities of different step counts.

### Fisherian Discriminant Functions.

The above geometric approach is a special case of the more general probabilistic approach to deriving discriminant functions proposed by Fisher, and we will now derive such functions that consider both the centroids and covariances of the sets of landing probabilities.

A Fisherian discriminant function captures the first two moments (mean and variance) of the landing probabilities for each class (the in-class and out-class). The classes are described by multivariate Gaussians

Following a standard derivation from quadratic discriminant analysis, let **11** thus provides a quadratic discriminant function for ranking in-class membership in a manner that accounts for the covariance structure of the different landing probabilities (e.g., how the landing probability at a node

If we assume **11** reduces to**13** rather then Eq. **12**—let alone quadratic functions with the form of Eq. **11**—exist for graphs where the structure can reasonably be motivated as coming from an SBM with two classes of blocks.

Finally, we note that all of the above theoretical derivations assume known parameters; in *SI Appendix*, we describe methods that can learn the parameters of a SBM, informing the choice of

## Computational Results

We now evaluate our optimal geometric and Fisherian approaches to seed set expansion on graphs that have been generated by SBMs, where ground truth community structure is defined from the generative process (*Materials and Methods*). Exact recovery of planted partitions has been shown possible in regimes where the two blocks are well-separated (where

Fig. 2*A* shows the Pearson correlation

In Fig. 2*B* and *C*, we see cumulative recall curves for a SBM graph with equal block sizes and two choices of *SI Appendix*). The personalized PageRank classification is ranked according to Eq. **12**, whereas Lin-SBMRank and Quad-SBMRank have the forms of Eqs. **13** and **11**, respectively. The recall curves measure the recall of the classification methods seeded with a single-node seed set and attempting to return a seed block of *SI Appendix*, we present further evaluations of the methods, varying the average degree relative to the number of nodes

Our interest is in classification in the space of landing probabilities. Belief Propagation, which does not work in this framework, has been recently shown to achieve correlated classification up to the resolution limit (35⇓–37), as have modified spectral methods applied to the nonbacktracking Laplacian (38) and the Bethe Hessian (39). We thus take Belief Propagation as a benchmark for optimal performance but at the cost of high computational complexity. For a more comprehensive discussion of Belief Propagation, see *SI Appendix*. The walks underlying our landing probabilities are standard random walks, and it is interesting to consider whether landing probabilities of alternative random walks (e.g., nonbacktracking ones) may yield even better approaches.

## Discussion

This work contributes a principled motivation for personalized PageRank’s success in contextually ranking nodes in graphs. Personalized PageRank and heat kernel methods both rank using linear discriminant functions in the space of random walk landing probabilities. We show that personalized PageRank in fact arises as the optimal geometric discriminant function in the space of landing probabilities for classifying nodes in a hidden seed community in an SBM. Building on this connection between SBMs and personalized PageRank, we develop more complex covariance-adjusted linear and quadratic approaches to classification in the space of landing probabilities. These classifiers dramatically outperform personalized PageRank and heat kernel methods for recovering seed sets in SBMs.

The connection between personalized PageRank and SBMs is surprising, and we see it pointing toward a wide range of related research questions. Our work here has been focused on theoretical results and synthetic networks, but an interesting further direction would be to explore our covariance-adjusted methods on datasets arising in the real world. In *SI Appendix*, we illustrate an initial step in this direction, evaluating the methods on two datasets with a two-block structure: a network of political blogs from two parties and a network of web hosts from two different domains.

There is a wide range of further theoretical questions as well. Can the recent rigorous results for the resolution limit of SBMs (36) provide insights into a broader class of contextual ranking problems? Is there an alternative classification algorithm (or alternative graph model) where heat kernel methods emerge as optimal? Can other new or existing machine-learning or ranking methods be motivated through principled analyses of structured models? There are also a host of further questions that would serve to improve the details of the specific approach that we outline here. Can the joint distribution of random walk landing probabilities be modeled more explicitly than by a multivariate Gaussian (approximating just the first two moments)? The potential application of our quadratic discriminant classifier to diverse contextual ranking problems suggests revisiting the broad range of applied problems where PageRank has found success.

## Materials and Methods

Our evaluations examine the Pearson correlation *SI Appendix*.

## Acknowledgments

This work was supported, in part, by a Simons Investigator Award and grants from the Army Research Office, Google, and Facebook.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: kleinber{at}cs.cornell.edu.

Author contributions: I.M.K., J.U., and J.K. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1611275114/-/DCSupplemental.

## References

- ↵.
- Page L,
- Brin S,
- Motwani R,
- Winograd T

- ↵
- ↵.
- Gupta P, et al.

- ↵.
- Gleich DF

- ↵.
- Andersen R,
- Lang KJ

- ↵.
- Bagrow JP

- ↵.
- Mehler A,
- Skiena S

- ↵.
- Riedy J,
- Bader DA,
- Jiang K,
- Pande P,
- Sharma R

- ↵.
- Whang JJ,
- Gleich DF,
- Dhillon IS

- ↵.
- Yang J,
- Leskovec J

*Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics*(American Association for Computing Machinery, New York), Article No 3. - ↵.
- Kloumann IM,
- Kleinberg JM

- ↵.
- Haveliwala TH

- ↵.
- Jeh G,
- Widom J

- ↵.
- Chung F

- ↵.
- Kloster K,
- Gleich DF

- ↵
- ↵.
- Kamvar SD,
- Haveliwala TH,
- Manning CD,
- Golub GH

- ↵.
- Bahmani B,
- Chowdhury A,
- Goel A

- ↵.
- Lovász L,
- Simonovits M

- ↵.
- Andersen R,
- Chung F,
- Lang K

- ↵.
- Chung F

- ↵.
- Spielman DA,
- Teng SH

- ↵.
- Mahoney MW,
- Orecchia L

- ↵.
- Zhou H,
- Woodruff D

- ↵.
- Tsourakakis C

- ↵.
- Holland PW,
- Laskey KB,
- Leinhardt S

- ↵.
- Condon A,
- Karp RM

- ↵.
- Dyer ME,
- Frieze AM

- ↵.
- Erdős P,
- Rényi A

- ↵
- ↵.
- Chung FRK

- ↵.
- Peng R,
- Sun H,
- Zanetti L

- ↵.
- McSherry F

- ↵.
- Becchetti L,
- Clementi AEF,
- Natale E,
- Pasquale F,
- Trevisan L

- ↵.
- Decelle A,
- Krzakala F,
- Moore C,
- Zdeborová L

- ↵.
- Mossel E,
- Neeman J,
- Sly A

- ↵.
- Zhang P,
- Moore C

- ↵.
- Krzakala F, et al.

- ↵.
- Saade A,
- Krzakala F,
- Zdeborová L

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Computer Sciences