## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# The impossibility of low-rank representations for triangle-rich complex networks

Edited by Mark E. J. Newman, University of Michigan, Ann Arbor, MI, and accepted by Editorial Board Member Peter J. Bickel February 1, 2020 (received for review June 26, 2019)

## Significance

Our main message is that the popular method of low-dimensional embeddings provably cannot capture important properties of real-world complex networks. A widely used algorithmic technique for modeling these networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to common wisdom, we argue that such graph embeddings do not capture salient properties of complex networks. We mathematically prove that low-dimensional embeddings cannot generate graphs with both low average degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. This establishes that popular low-dimensional embedding methods fail to capture significant structural aspects of real-world complex networks.

## Abstract

The study of complex networks is a significant development in modern science, and has enriched the social sciences, biology, physics, and computer science. Models and algorithms for such networks are pervasive in our society, and impact human behavior via social networks, search engines, and recommender systems, to name a few. A widely used algorithmic technique for modeling such complex networks is to construct a low-dimensional Euclidean embedding of the vertices of the network, where proximity of vertices is interpreted as the likelihood of an edge. Contrary to the common view, we argue that such graph embeddings do not capture salient properties of complex networks. The two properties we focus on are low degree and large clustering coefficients, which have been widely established to be empirically true for real-world networks. We mathematically prove that any embedding (that uses dot products to measure similarity) that can successfully create these two properties must have a rank that is nearly linear in the number of vertices. Among other implications, this establishes that popular embedding techniques such as singular value decomposition and node2vec fail to capture significant structural aspects of real-world complex networks. Furthermore, we empirically study a number of different embedding techniques based on dot product, and show that they all fail to capture the triangle structure.

- graph embeddings
- graph representations
- low-dimensional embeddings
- low-rank representations
- singular value decomposition

Complex networks (or graphs) are a fundamental object of study in modern science, across domains as diverse as the social sciences, biology, physics, computer science, and engineering (1⇓–3). Designing good models for these networks is a crucial area of research, and also affects society at large, given the role of online social networks in modern human interaction (4⇓–6). Complex networks are massive, high-dimensional, discrete objects, and are challenging to work with in a modeling context. A common method of dealing with this challenge is to construct a low-dimensional Euclidean embedding that tries to capture the structure of the network (see ref. 7 for a recent survey). Formally, we think of the n vertices as vectors

The most important method to get such embeddings is the singular value decomposition (SVD) or other matrix factorizations of the adjacency matrix (8). Recently, there has also been an explosion of interest in using methods from deep neural networks to learn such graph embeddings (9⇓⇓–12) (refer to ref. 7 for more references). Regardless of the specific method, a key goal in building an embedding is to keep the dimension d small—while trying to preserve the network structure—as the embeddings are used in a variety of downstream modeling tasks such as graph clustering, nearest-neighbor search, and link prediction (13). Yet a fundamental question remains unanswered: To what extent do such low-dimensional embeddings actually capture the structure of a complex network?

These models are often justified by treating the (few) dimensions as “interests” of individuals, and using similarity of interests (dot product) to form edges. Contrary to the dominant view, we argue that low-dimensional embeddings are not good representations of complex networks. We demonstrate mathematically and empirically that they lose local structure, one of the hallmarks of complex networks. This runs counter to the ubiquitous use of SVD in data analysis. The weaknesses of SVD have been empirically observed in recommendation tasks (14⇓–16), and our result provides a mathematical validation of these findings.

Let us define the setting formally. Consider a set of vectors *Alternate Models*. Matrix factorization is a popular method to obtain such a vector representation: The original adjacency matrix A is “factorized” as

Two hallmarks of real-world graphs are 1) sparsity, where the average degree is typically constant with respect to n, and 2) triangle density, where there are many triangles incident to low-degree vertices (5, 20⇓–22). The large number of triangles is considered a local manifestation of community structure. Triangle counts have a rich history in the analysis and algorithmics of complex networks. Concretely, we measure these properties simultaneously as follows.

**Definition 1.** *For parameters* *and* *, a graph* G *with* n *vertices has a* *-triangle foundation if there are at least* *triangles contained among vertices of degree, at most,* c*. Formally, let* *be the set of vertices of degree, at most,* c*. Then, the number of triangles in the graph induced by* *is at least*

Typically, we think of both c and Δ as constants. We emphasize that n is the total number of vertices in G, not the number of vertices in S (as defined above). Refer to real-world graphs in Table 1. In Fig. 1, we plot the value of c vs. Δ. (Specifically, the y axis is the number of triangles divided by n.) This is obtained by simply counting the number of triangles contained in the set of vertices of degree, at most, c. Observe that, for all graphs, for

Our main result is that any embedding of graphs that generates graphs with

**Theorem 1.** *Fix* *. Suppose the expected number of triangles in* *that only involve vertices of expected degree* c *is at least* *. Then, the rank of* V *is at least*

Equivalently, graphs generated from low-dimensional embeddings cannot contain many triangles only on low-degree vertices. We point out an important implication of this theorem for Stochastic Block Models. In this model, each vertex is modeled as a vector in **Theorem 1** implies that d must be

## Empirical Validation

We empirically validate the theory on a collection of complex networks detailed in Table 1. For each real-world graph, we compute a 100-dimensional embedding through SVD (basically, the top 100 singular vectors of the adjacency matrix). We generate 100 samples of graphs from these embeddings, and compute their c vs. Δ plot. This is plotted with the true c vs. Δ plot. (To account for statistical variation, we plot the maximum value of Δ observed in the samples, over all graphs. The variation observed was negligible.) Fig. 1 shows such a plot for a physics coauthorship network. More results are given in *SI Appendix*.

Note that this plot is significantly off the mark at low degrees for the embedding. Around the lowest degree, the value of Δ (for the graphs generated by the embedding) is two to three order of magnitude smaller than the original value. This demonstrates that the local triangle structure is destroyed around low-degree vertices. Interestingly, the total number of triangles is preserved well, as shown toward the right side of each plot. Thus, a nuanced view of the triangle distribution, as given in **Definition 1**, is required to see the shortcomings of low dimensional embeddings.

## Alternate Models

We note that several other functions of dot product have been proposed in the literature, such as the softmax function (10, 12) and linear models of the dot product (7). **Theorem 1** does not have direct implications for such models, but our empirical validation holds for them as well. The embedding in **Theorem 1** uses the *truncated dot product* (TDP) function *Alternate Graph Models*.)

For each of these models (softmax, LRDP, and LRHP), we perform the same experiment described above. Fig. 1 also shows the plots for these other models. Observe that none of them capture the low-degree triangle structure, and their Δ values are all two to three orders of magnitude lower than the original.

In addition (to the extent possible), we compute vector embeddings from a recent deep learning-based method [node2vec (12)]. We again use all of the edge probability models discussed above, and perform an identical experiment (in Fig. 1, these are denoted by “n2v”). Again, we observe that the low-degree triangle behavior is not captured by these deep learned embeddings.

## Broader Context

The use of geometric embeddings for graph analysis has a rich history, arguably going back to spectral clustering (23). In recent years, the Stochastic Block Model has become quite popular in the statistics and algorithms community (17), and the Random Dot Product Graph model is a generalization of this notion [refer to recent surveys (19, 24)]. As mentioned earlier, **Theorem 1** brings into question the standard uses of these methods to model social networks. The use of vectors to represent vertices is sometimes referred to as *latent space models*, where geometric proximity models the likelihood of an edge. Although dot products are widely used, we note that some classic latent space approaches use Euclidean distance (as opposed to dot product) to model edge probabilities (25), and this may avoid the lower bound of **Theorem 1**. Beyond graph analysis, the method of Latent Semantic Indexing also falls in the setting of **Theorem 1**, wherein we have a low-dimensional embedding of “objects” (like documents), and similarity is measured by dot product (https://en.wikipedia.org/wiki/Latent_semantic_analysis).

## High-Level Description of the Proof

In this section, we sketch the proof of **Theorem 1**. The sketch provides sufficient detail for a reader who wants to understand the reasoning behind our result, but is not concerned with technical details. We will make the simplifying assumption that all *Dealing with Varying Lengths* provides intuition on how we can remove this assumption. The full details of the proof are given in *Proof of Theorem 1*.

First, we lower-bound L. By Cauchy–Schwartz,

The expected number of triangles in

The expected number of triangles is at least

**Lemma 1** (Rank lemma). *Consider any square matrix* *. Then*

We plug these bounds into the rank lemma. We use the fact that

### Dealing with Varying Lengths.

The math behind Eq. **4** still holds with the right approximations. Intuitively, the existence of at least

The rank lemma is the main technical tool that formalizes this intuition. When vectors are of varying length, the primary obstacle is the presence of extremely long vectors that create triangles. The numerator in the rank lemma sums

But, because the vectors inhabit low-dimensional space, the long vectors from different clusters interact with each other. We prove a “packing” lemma (**Lemma 5**) showing that there must be many large positive dot products among a set of extremely long vectors. Thus, many of the corresponding vertices have large degree, and triangles incident to these vertices do not contribute to low-degree triangles. Operationally, the main proof uses the packing lemma to show that there are few long vectors. These can be removed without affecting the low-degree structure. One can then perform a binning (or “rounding”) of the lengths of the remaining vectors, to implement the proof described in the above section.

## Proof of Theorem 1

For convenience, we restate the setting. Consider a set of vectors

Let

### The Basic Tools.

We now state some results that will be used in the final proof. **Lemma 2** is an existing result. For all other statements, the proofs are provided in *SI Appendix*.

**Lemma 2.** *[Rank lemma* (26)*] Consider any square matrix* *. Then*

**Lemma 3.** *Consider a set of s vectors* *in*

**Lemma 4.** *Any graph with* h *vertices and maximum degree* b *has an independent set of at least*

**Proposition 1.** *Consider the distribution* *. Let* *denote the degree of vertex*

A key component of dealing with arbitrary-length vectors is the following dot product lemma. This is inspired by results of Alon (27) and Tao (28), who get a stronger lower bound of *absolute values* of the dot products.

**Lemma 5.** *Consider any set of* *unit vectors* *in* *. There exists some* *such that*

### The Main Argument.

We prove by contradiction. We assume that the expected number of triangles contained in the set of vertices of expected degree, at most, c is at least

The overall proof can be thought of in three parts.

*Part 1, remove extremely long vectors:* Our final aim is to use the rank lemma (**Lemma 2**) to lower bound the rank of V. The first problem we encounter is that extremely long vectors can dominate the expressions in the rank lemma, and we do not get useful bounds. We show that the number of such long vectors is extremely small, and they can be removed without affecting too many triangles. In addition, we can also remove extremely small vectors, since they cannot participate in many triangles.

*Part 2, find a “core” of sufficiently long vectors that contains enough triangles:* The previous step gets a “cleaned” set of vectors. Now, we bucket these vectors by length. We show that there is a large bucket, with vectors that are sufficiently long, such that there are enough triangles contained in this bucket.

*Part 3, apply the rank lemma to the “core”:* We now focus on this core of vectors, where the rank lemma can be applied. At this stage, the mathematics shown in *High-Level Description of the Proof* can be carried out almost directly.

Now for the formal proof. For the sake of contradiction, we assume that

**Part 1: Removing extremely long (and extremely short) vectors**

We begin by showing that there cannot be many long vectors in

**Lemma 6.** *There are, at most,* *vectors of length at least*

*Proof*. Let L be the set of “long” vectors, those with a length of at least

Formally, for any edge **Lemma 4**, H contains an independent set I of a size of at least **Lemma 5** to this sequence, we deduce the existence of

Denote by

**Proposition 2.** *The expected degree of every vertex in* *is, at most,* c*, and the expected number of triangles in* G *is at least*

*Proof*. Since removal of vectors can only decrease the degree, the expected degree of every vertex in

We start with the first type. By **Lemma 6**, there are, at most, **Proposition 1**, the expected degree squares is, at most,

Consider any triple of vectors

Thus, the expected number of triangles in

**Part 2: Finding core of sufficiently long vectors with enough triangles**

For any integer r, let

**Proposition 3.** *The expected number of triangles in* *is at least*

*Proof*. The total number of vectors in **Proposition 1** and linearity of expectation, the expected sum of squares of degrees of all vectors in **Proposition 2**) and the expected number of triangles incident to vectors in

We now come to an important proposition. Because the expected number of triangles in

**Proposition 4.**

*Proof*. Suppose not. Then every vector in **Proposition 1**, this is, at most, **Proposition 3**. □

**Part 3: Applying the rank lemma to the core**

We are ready to apply the rank bound of **Lemma 2** to prove the final result. The following lemma contradicts our initial bound on the rank d, completing the proof. We will omit some details in the following proof, and provide a full proof in *SI Appendix*.

**Lemma 7.**

*Proof*. It is convenient to denote the index set of **Lemma 2**,

We lower-bound the numerator.*SI Appendix*. The main upshot is that we can prove

Crucially, by **Proposition 4**,

□

## Details of Empirical Results

### Data Availability.

The datasets used are summarized in Table 1. We present here four publicly available datasets from different domains. The ca-HepPh is a coauthorship network, Facebook is a social network, and cit-HepPh is a citation network, all obtained from the SNAP graph database (29). The String_hs dataset is a protein–protein interaction network obtained from ref. 30. (The citations provide the link to obtain the corresponding datasets.)

We first describe the primary experiment, used to validate **Theorem 1** on the SVD embedding. We generated a d-dimensional embedding for various values of d using the SVD. Let G be a graph with the

From the spectral embeddings, we generate a graph from

### Triangle Distributions.

To generate Figs. 1 and 2, we calculated the number of triangles incident to vertices of different degrees in both the original graphs and the graphs generated from the embeddings. Each of the plots shows the number of triangles in the graph on the vertical axis and the degrees of vertices on the horizontal axis. Each curve corresponds to some graph, and each point

### Alternate Graph Models.

We consider three other functions of the dot product, to construct graph distributions from the vector embeddings. Details on parameter settings and the procedure used for the optimization are given in *SI Appendix*.

#### LRDP.

We consider the probability of an edge

*LRHP*.

This is inspired by linear models used on low-dimensional embeddings (7). Define the Hadamard product

#### Softmax.

This is inspired by low-dimensional embeddings for random walk matrices (10, 12). The idea is to make the probability of edge

#### node2vec experiments.

We also applied node2vec, a recent deep learning-based graph embedding method (12), to generate vector representations of the vertices. We use default parameters to run node2vec. (More details are provided in *SI Appendix*.) The node2vec algorithm tries to model the random walk matrix associated with a graph, not the raw adjacency matrix. The dot products between the output vectors

In Figs. 1 and 2, we show results for all of the datasets. We note that, for all datasets and all embeddings, the models fail to capture the low-degree triangle behavior.

### Degree Distributions.

We observe that the low-dimensional embeddings obtained from SVD and TDP can capture the degree distribution accurately. In Fig. 3, we plot the degree distribution (in loglog scale) of the original graph with the expected degree distribution of the embedding. For each vertex i, we can compute its expected degree by the sum

## Acknowledgments

C.S. acknowledges the support of NSF Awards CCF-1740850 and CCF-1813165, and ARO Award W911NF1910294.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: sesh{at}ucsc.edu.

Author contributions: C.S., A. Sharma, A. Stolman, and A.G. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission. M.E.J.N. is a guest editor invited by the Editorial Board.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1911030117/-/DCSupplemental.

- Copyright © 2020 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY).

## References

- ↵
- S. Wasserman,
- K. Faust

- ↵
- ↵
- D. Easley,
- J. Kleinberg

- ↵
- A. L. Barabasi,
- R. Albert

- ↵
- ↵
- D. Chakrabarti,
- C. Faloutsos

- ↵
- W. L. Hamilton,
- R. Ying,
- J. Leskovec

- ↵
- A. Ahmed,
- N. Shervashidze,
- S. Narayanamurthy,
- V. Josifovski,
- A. J. Smola

- ↵
- S. Cao,
- W. Lu,
- Q. Xu

- ↵
- B. Perozzi,
- R. Al-Rfou,
- S. Skiena

- ↵
- J. Tang et al.

- ↵
- A. Grover,
- J. Leskovec

- ↵@twittereng, Embeddings@twitter. https://blog.twitter.com/engineering/en_us/topics/insights/2018/embeddingsattwitter.html.
- ↵
- B. Bahmani,
- A. Chowdhury,
- A. Goel

- ↵
- P. Gupta et al.

- ↵
- I. M. Kloumann,
- J. Ugander,
- J. Kleinberg

- ↵
- ↵
- S. J. Young,
- E. R. Scheinerman

- ↵
- A. Athreya et al.

- ↵
- A. Sala et al.

- ↵
- C. Seshadhri,
- T. G. Kolda,
- A. Pinar

- ↵
- N. Durak,
- A. Pinar,
- T. G. Kolda,
- C. Seshadhri

- ↵
- M. Fiedler

- ↵
- E. Abbe

- ↵
- ↵
- ↵
- N. Alon

- ↵
- T. Tao

- ↵
- J. Leskovec

- ↵STRING Consortium, String database. http://version10.string-db.org/. Accessed 1 November 2019.
- J. Tang,
- J. Zhang,
- L. Yao,
- J. Li,
- L. Zhang,
- Z. Su

- L. Yao,
- J. Li,
- L. Zhang,
- Z. Su

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Computer Sciences