## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Accurate and scalable social recommendation using mixed-membership stochastic block models

Edited by Edoardo Airoldi, Harvard University, and accepted by Editorial Board Member Adrian E. Raftery October 18, 2016 (received for review April 21, 2016)

## Significance

Recommendation systems are designed to predict users’ preferences and provide them with recommendations for items such as books or movies that suit their needs. Recent developments show that some probabilistic models for user preferences yield better predictions than latent feature models such as matrix factorization. However, it has not been possible to use them in real-world datasets because they are not computationally efficient. We have developed a rigorous probabilistic model that outperforms leading approaches for recommendation and whose parameters can be fitted efficiently with an algorithm whose running time scales linearly with the size of the dataset. This model and inference algorithm open the door to more approaches to recommendation and to other problems where matrix factorization is currently used.

## Abstract

With increasing amounts of information available, modeling and predicting user preferences—for books or articles, for example—are becoming more important. We present a collaborative filtering model, with an associated scalable algorithm, that makes accurate predictions of users’ ratings. Like previous approaches, we assume that there are groups of users and of items and that the rating a user gives an item is determined by their respective group memberships. However, we allow each user and each item to belong simultaneously to mixtures of different groups and, unlike many popular approaches such as matrix factorization, we do not assume that users in each group prefer a single group of items. In particular, we do not assume that ratings depend linearly on a measure of similarity, but allow probability distributions of ratings to depend freely on the user’s and item’s groups. The resulting overlapping groups and predicted ratings can be inferred with an expectation-maximization algorithm whose running time scales linearly with the number of observed ratings. Our approach enables us to predict user preferences in large datasets and is considerably more accurate than the current algorithms for such large datasets.

- recommender systems
- stochastic block model
- collaborative filtering
- social recommendation
- scalable algorithm

The goal of recommender systems is to predict what movies we are going to like, what books we are going to purchase, or even who we might be interested in dating. The rapidly growing amount of data on item reviews, ratings, and purchases from a growing number of online platforms holds the promise to facilitate the development of more informed models for recommendation. At the same time, however, it poses the challenge of developing algorithms that can handle such large amounts of data accurately and efficiently.

A plausible expectation when developing recommendation algorithms is that similar users relate to similar items in similar ways; e.g., they purchase similar items and give the same item similar ratings. This means that we can use the rating history of a set of users to make recommendations, even without knowing anything about the characteristics of users or items. This is the basic underlying assumption of collaborative filtering, one of the most common approaches in recommender systems (1). However, most research in recommender systems has focused on the development of scalable algorithms, often at the price of implicitly using models that are overly simplistic or unrealistic. For example, matrix factorization and latent feature approaches assume that users and items live in an abstract low-dimensional space, but whether such a space is expressive enough to accommodate the rich variety of user behaviors is rarely discussed. As a result, many current approaches have significantly lower accuracies than inference approaches based on models of user preferences that are socially more realistic (2). On the other hand, these more realistic approaches do not scale well with dataset size, which makes them unpractical for large datasets.

Here, we develop a model and algorithm for predicting user ratings based on explicit probabilistic hypotheses about user behavior. As in some previous approaches, we assume that there are groups of users and of items and that the rating a user assigns to an item is determined probabilistically by their group memberships. However, we do not assign users and items to a single group; instead, we allow each user and each item to belong to mixtures of different groups (3, 4). In addition, unlike standard matrix factorization, we do not assume that ratings depend linearly on a measure of similarity between users and items; instead, we allow each pair of groups to have any probability distribution of ratings. We combine these elements to form a generative model, which assigns a precise probability to each possible rating. Fortunately, the inference problem for this model can be solved very efficiently: We give an expectation-maximization algorithm whose running time, per iteration, scales linearly with the number of observed ratings and convergesrapidly.

We show that our approach consistently outperforms state-of-the-art recommendation algorithms, often by a large margin. In addition, our probabilistic predictions are better calibrated to real data in the frequentist sense (5), generating distributions of ratings that are statistically similar to real data. Moreover, because our model has a clear probabilistic interpretation, it can deal naturally with some situations that are challenging for other approaches, such as the cold start problem. We argue that our approach may also be suitable for other areas where matrix factorization is increasingly used such as image reconstruction, textual data mining, cluster analysis, or patterndiscovery (6⇓⇓⇓–10).

## A Mixed-Membership Block Model with Metadata

Our approach begins with the mixed-membership stochastic block model (MMSBM), which has been used to model networks. As in the original MMSBM (3) and related models (11), we assume that each node in the bipartite graph of users and items belongs to a mixture of groups. However, unlike refs. 3 and 11, we do not assume that these group memberships affect the presence or absence of a link. Instead, we take the set of links as given and attempt to predict the ratings. We do this with an MMSBM-like model where the rating a user gives an item is drawn from a probability distribution that depends on their group memberships.

First we set down some notation. We have

Our generative model for the ratings is as follows. There are

To model mixed group memberships, each user

Abbreviating all these parameters as

As we discuss below, we infer the values of the parameters

Our work differs from previous work on collaborative filtering in several ways. First, unlike matrix factorization approaches such as ref. 12 or their probabilistic counterparts (13⇓–15), we do not think of the ratings

Second, we do not assume that the matrices

Third, unlike some approaches that use inference methods similar to ours (17), as stated above, our goal is not to predict the existence of links. In particular, we do not assume that users see only movies (say) that they like, and we do not treat missing links as zeros or low ratings. To put this differently, we are not trying to complete

As we describe below, our model also has the advantage of being mathematically tractable. It yields a highly efficient expectation-maximization algorithm for fitting the parameters: The time each iteration takes is linear on the number of users, items, and observed links. As a result, we are able to handle large datasets and achieve a higher accuracy than standard methods. We also show that the probabilistic predictions made by our model are well calibrated in the frequentist sense (5), producing distributions of ratings statistically similar to real data.

## Scalable Inference of Model Parameters

In most practical situations, marginalizing exactly over the group membership vectors

In particular, we use a classic variational approach (*Materials and Methods*) to obtain the following equations for the model parameters that maximize the likelihood:

Here

These equations can be solved with an EM algorithm. Starting with an estimate of *i*) (expectation step) use Eq. **6** to compute *ii*) (maximization step) use Eqs. **3**–**5** to compute

The number of parameters and terms in the sums in Eqs. **3**–**6** is *A*). As the set of observed ratings

## Results

### The MMSBM Predicts Ratings Accurately.

We test the performance of our algorithm in six datasets: the MovieLens 100K and 10M datasets, respectively; Yahoo! Songs; Amazon books (18, 19); and the LibimSeTi.cz dating agency (20), which (because it is primarily heterosexual) we split into two datasets, consisting of males rating females and vice versa. These datasets are diverse in the types of items, the sizes

We compare our algorithm to four benchmark algorithms (see *Supporting Information*, *Benchmark Algorithms*): a baseline naive algorithm that assigns to each test rating *B*).

For our algorithm, we set **3**–**6** can lead to different fixed points depending on its initial conditions, we perform 500 independent runs. We average the predicted probability distribution of ratings over the resulting fixed points, because we find they typically have comparable likelihood values (Fig. S3).

We can translate the resulting probability distribution of ratings into a single predicted rating by choosing an estimator; which one is optimal depends on the loss function or equivalently the measure of accuracy. We focus on two measures. For each algorithm, we define the accuracy as the fraction of ratings that are predicted exactly, and we also measure the mean absolute error (MAE). For these two, the optimal estimator is the mode and the median, respectively.

We find that in most datasets our approach outperforms the item–item algorithm, matrix factorization (MF), and MMMF (Fig. 1). Indeed, the accuracy, i.e., the fraction of exactly correct predictions, of the MMSBM is significantly higher than that of MF and MMMF for all of the datasets we tested and higher than the item–item algorithm in five of six datasets, the only exception being Amazon Books. The MAE of the MMSBM is the best (lowest) in four of the six datasets; item–item produces smaller MAE in Amazon Books; and item–item, MF, and MMMF produce smaller MAE in MovieLens 10M.

Interestingly, our approach produces results that are almost identical to those of the unmixed, fully marginalized SBM (2) for the two examples for which inference with the SBM is feasible. In particular, we achieve the same accuracy with

### The MMSBM Generalizes Matrix Factorization.

MF is one of the most successful and popular approaches to collaborative filtering, both in its classical (12) and its probabilistic form (13⇓⇓⇓–17). However, as discussed, our MMSBM gives more accurate ratings, often by a large margin. Here, we propose an explanation for this improvement.

We start by giving an interpretation of MF as a special case of the MMSBM. In its simplest form, MF assumes that the expected rating that user

If we set

Our MMSBM relaxes these assumptions by allowing the distribution of ratings to be given by arbitrary matrices

Moreover, the generality of the MMSBM allows it to account for many of the features of real ratings. For instance, different groups of users have different distributions of ratings: Users in group

One can also compare our MMSBM to MMMF, because both attempt to take the mixed-membership nature of users and items into account. However, the analogy is not perfect: MMMF models ratings as the sum of a MF term and a correction that uses mixed group memberships that are unrelated to the feature vectors (22). Although this is an improvement over MF, it does not fundamentally remove the assumption that each group of users has a corresponding group of items that it prefers. Indeed, our numerical results show that the performance of MMMF is fairly close to that of MF in the datasets we considered.

### The MMSBM Makes Well-Calibrated Probabilistic Predictions.

Finally, our approach directly yields probabilistic predictions of the ratings, i.e., probability distributions on the discrete set

As we show in Fig. 3, the predictions of the MMSBM are indeed probabilistically and marginally well calibrated. Thus, in addition to giving accurate ratings in the sense of the MAE and the probability the rating is exactly correct, the MMSBM generates predictions that are statistically similar to real data, indicating that it captures the stochastic nature of the rating process.

Because MF and MMMF produce Gaussian distributions of real-valued ratings, to perform analogous calibration experiments we transform their predictions into a discrete probability distribution by integrating over the real numbers closest to each *Supporting Information*). For instance, if

### The MMSBM Provides a Principled Method to Deal with the Cold Start Problem.

Because the parameters of the MMSBM have a precise probabilistic interpretation, it can naturally deal with situations that are challenging for other algorithms. An example of this is the “cold start” problem, where we need to predict ratings for users or items for which we do not have training data (14, 23, 24).

In the MMSBM, the

In Fig. 4 we show that, in cold start situations, the MMSBM outperforms the other algorithms in most cases. MMSBM is always more accurate than MF and MMMF (although in one case the difference is not significant). In all but one case, the MMSBM is also more accurate than an algorithm that assigns the most common rating to an item. In terms of mean absolute error, our approach is more accurate than MF and MMMF in four of five datasets (in one, not significantly) and more accurate than using the most common rating in four of five cases.

Note that none of these approaches takes metadata on users or items into account, which is a standard approach to the cold start problem. For instance, one could assume that a new user will behave similarly to others of the same age, gender, etc. (Fig. S5), and compute the average membership vector over these users. We performed experiments restricting the average to users with same gender and/or age, but we found it did not significantly improve the performance (Fig. S6).

## Discussion

We have shown that the MMSBM with its associated expectation-maximization algorithm is an accurate and scalable method to predict user–item ratings in a variety of contexts. It significantly outperforms other algorithms, including MF and MMMF, in most of the datasets we considered, both maximizing the probability that the predicted rating is exactly correct and minimizing the mean absolute error.

Additionally, because the model and its parameters are readily interpretable, it can be extended to (and performs well in) situations that are challenging for other approaches, such as a cold start where no prior information is available about a new user or item; one could also consider extensions of the model that take into account metadata for users (e.g., age and gender) and/or items (e.g., genre), analogous to unmixed stochastic block models with node metadata (25).

Finally, because the MMSBM assigns a probability to each possible rating, it is amenable to frequentist calibration, and we found that its predictions are in fact statistically similar to real data as measured by probabilistic and marginal calibration (5). We believe that this performance is due to the fact that the MMSBM is a more expressive generalization of matrix factorization, allowing each pair of user and item groups to have an arbitrary probability distribution of ratings. Matrix factorization is a widely used tool with many applications beyond recommendation; given our findings, it may make sense to use the MMSBM in those other applicationsas well.

## Materials and Methods

We maximize the likelihood **2** as a function of

Here **6** for the expectation step.

For the maximization step, we derive update equations for the parameters **8**. Including Lagrange multipliers for the normalization constraints, we obtain

## Scalability with the Size of the Dataset

## Datasets

We perform experiments on six different datasets: the MovieLens 100K and 10M datasets, Yahoo! Songs, and the LibimSeTi.cz dating agency. We split the LibimSeTi.cz dataset into two datasets: women rating men (W-M) and men rating women (M-W). We neglected the links of women rating women and men rating men; unfortunately these links constituted only 1% of the dataset. In Table S1, we show the characteristics of each dataset in terms of the scale of ratings

## Benchmark Algorithms

### Naive Model.

As a baseline for comparison, we consider a naive model. Its prediction for a rating

### Item–Item.

The item–item algorithm uses the cosine similarity between items, based on the

### MF.

One of the most widely used recommendation algorithms is MF (12, 26). Like in the block model, the intuition behind matrix factorization is that there should be some latent features that determine how a user rates an item. However, MF uses linear algebra to reduce the dimensionality of the problem. Specifically, it assumes that the matrix of ratings

We then assume that some noise and/or bias has been applied to

To perform the calibration analysis we need to compute the ratings’ probability distribution. Because in MF we obtain the probabilistic prediction by minimizing the sum of quadratic errors, it is equivalent to obtaining the maximum-likelihood estimators of the parameters when ratings are subject to a Gaussian noise,**S2** and

In this case, the probability of each rating

### MMMF.

The MMMF algorithm combines matrix factorization with mixed-membership context bias. In MMMF, users and items are endowed with both latent factor vectors (

In ref. 22 the authors consider two different MMMF models that differ in how the contextual bias is built. The topic-indexed bias (TIB) model assumes that the contextual bias decomposes into a latent user bias and a latent item bias so that

We use their code implementation (https://code.google.com/archive/p/m3f/) with

The probability distribution of ratings for the MMMF would follow a similar procedure to that of MF. In this case, the probabilistic interpretation is raised directly form Eq. **S6**. The deviation

### SBM.

The SBM (28⇓–30) assumes that the probability that two nodes form a link between them, such as a relationship between actors in a social network, depends on what groups they belong to. Analogously, the SBM recommender algorithm (2) assumes that the probability of a rating **S8** also averages over all assignments

## Performance as a Function of the Number of Groups

## Comparison of the Predictions of the Maximum Likelihood vs. the Prediction of the Average over all of the Sampling

The solutions obtained from the sampling set of 500 independent initial conditions are different, but typically similar to each other as is shown in Fig. S3 (*Left* column). Therefore, it is not clear that we can assign the solution with the maximum likelihood over all of the set as the best one. This is different from what one would expect from well-behaved physical systems, where typically one solution is much better than the others. As a result, we built our predictions by sampling from different maximum-likelihood solutions. Specifically, for each of these solutions we get a probability distribution of ratings for each user–item pair

## Top-Membership Scores Coverage for MF and MMSBM

As explained in the main text, both the MF and MMSBM consider mixing or membership vectors as main parameters in the model. Differences in the distributions obtained for such parameters could contribute to the differences in accuracy we observe. As a matter of fact, flat distributions of such parameters could mean that there are truly no strong group membership patterns and we would expect low performances. To investigate whether this could be the cause of the different performances of the MF and the MMMSBM, we compute the probability distributions of the membership vector scores for both MF and MMSBM algorithms. We define the MF membership vector as the normalized vectors of features

## Comparison of the Social Trends Found with Different Approaches

In a collaborative-filtering approach to recommendation, the assumption is that one can predict user affinities to query items based on the affinities of similar users to those items. Then, a plausible hypothesis is that similar users are likely to share some demographic characteristics such as gender or age. In particular, from the different user representations in each of the models in the main text we can investigate the social and psychological processes that determine user behaviors. To illustrate this idea, we analyze the user profiles in the MovieLens 100K dataset, which lists the age and gender of each user.

Specifically, we compare the user profiles of pairs of users *K*-dimensional feature vectors. The users’ profiles in the user–user approach would be analogous to the item’s vector of ratings in the item–item algorithm; that is, each user is represented by an *M*-dimensional vector where each entry is the rating value he or she gives to each movie or zero otherwise.

Fig. S5 shows that independently of the model we use, when we divide users according to gender, pairs of male users have more similar profiles than pairs of female users or male–female pairs. However, the different approaches differ when we combine gender and age to define user groups. Whereas our MMSBM and user–user approaches suggest pair similarities decrease with age, MF shows opposite results.

## Users’ Cold Start Approach Using Gender and Age Similarities

Here we investigate whether the similarities found between certain classes of users (Fig. S5) can be leveraged to provide better predictions in cold start situations. For our analysis we used the MovieLens 100K dataset, for which we know the gender and age of each user. We have no cold start users in the MovieLens 100K dataset. Nonetheless, we simulate cold start situations by removing all ratings of a number of users from the training set and making predictions on those users. In particular, we perform leave-one-out cross-validation experiments over 50 males and 50 females in the dataset.

Our approach in the main text was to assign to any new user

Here, we consider two approaches. First, we use two distinct group membership vectors,

Second, for each cold start user, we use a weighted average over observed users where the weight is the similarity between the queried user and the known users in terms of their age and gender,

## Acknowledgments

We thank C. Shalizi for helpful comments and suggestions. We also thank L. Mackey providing us his code and assistance for the MMMF algorithm. This work was supported by a James S. McDonnell Foundation Research Award (to R.G. and M.S.-P.), Spanish Ministerio de Economia y Comptetitividad Grants FIS2013-47532-C3 (to A.G.L., R.G., and M.S.-P.) and FIS2015-71563-ERC (to R.G.), European Union Future and Emerging Technologies (FET) Grant 317532 [multilevel complex networks and systems (MULTIPLEX), to R.G. and M.S.-P.], the John Templeton Foundation (C.M.), and the army research office (ARO) under Contract W911NF-12-R-0012 (to C.M.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: antonia.godoy{at}urv.cat.

Author contributions: A.G.-L., R.G., C.M., and M.S.-P. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. E.A. is a Guest Editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1606316113/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Su X,
- Khoshgoftaar TM

- ↵
- ↵
- ↵.
- Peixoto TP

- ↵.
- Gneiting T,
- Balabdaoui F,
- Raftery AE

- ↵.
- Cemgil AT

- ↵
- ↵.
- Ding C,
- He X,
- Simon HD

- ↵.
- Kim J,
- Park H

- ↵.
- Brunet JP,
- Tamayo P,
- Golub TR,
- Mesirov JP

- ↵
- ↵
- ↵.
- Meeds E,
- Ghahramani Z,
- Neal RM,
- Roweis ST

- ↵.
- Salakhutdinov R,
- Mnih A

- ↵.
- Shan H,
- Banerjee A

- ↵.
- Ekstrand MD,
- Ludwig M,
- Konstan JA,
- Riedl JT

*Proceedings of the fifth ACM Conference on Recommender Systems*[Association for Computing Machinery (ACM), New York], pp 133–140. - ↵.
- Gopalan P,
- Hofman JM,
- Blei DM

- ↵.
- McAuley J,
- Targett C,
- Shi Q,
- van den Hengel A

- ↵.
- McAuley J,
- Pandey R,
- Leskovec J

- ↵.
- Brozovsky L,
- Petricek V

- ↵.
- Sarwar B,
- Karypis G,
- Konstan J,
- Riedl J

- ↵.
- Mackey L,
- Weiss D,
- Jordan MI

- ↵.
- Schein AI,
- Popescul A,
- Ungar LH,
- Pennock DM

- ↵.
- Park ST,
- Chu W

- ↵.
- Newman MEJ,
- Clauset A

- ↵.
- Paterek A

- ↵.
- Gardner W

- ↵.
- Holland PW,
- Laskey KB,
- Leinhardt S

- ↵
- ↵.
- Guimerà R,
- Sales-Pardo M

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Related Content

- No related articles found.

### Cited by...

- No citing articles found.

### Similar Articles

## You May Also be Interested in

## Sign up for Article Alerts

## Jump to section

- Article
- Abstract
- A Mixed-Membership Block Model with Metadata
- Scalable Inference of Model Parameters
- Results
- Discussion
- Materials and Methods
- Scalability with the Size of the Dataset
- Datasets
- Benchmark Algorithms
- Performance as a Function of the Number of Groups
- Comparison of the Predictions of the Maximum Likelihood vs. the Prediction of the Average over all of the Sampling
- Top-Membership Scores Coverage for MF and MMSBM
- Comparison of the Social Trends Found with Different Approaches
- Users’ Cold Start Approach Using Gender and Age Similarities
- Acknowledgments
- Footnotes
- References

- Figures & SI
- Info & Metrics