# Randomized approximate nearest neighbors algorithm

See allHide authors and affiliations

Contributed by Peter Wilcox Jones, May 15, 2011 (sent for review January 24, 2011)

## Abstract

We present a randomized algorithm for the approximate nearest neighbor problem in *d*-dimensional Euclidean space. Given *N* points {*x*_{j}} in , the algorithm attempts to find *k* nearest neighbors for each of *x*_{j}, where *k* is a user-specified integer parameter. The algorithm is iterative, and its running time requirements are proportional to *T*·*N*·(*d*·(log *d*) + *k*·(*d* + log *k*)·(log *N*)) + *N*·*k*^{2}·(*d* + log *k*), with *T* the number of iterations performed. The memory requirements of the procedure are of the order *N*·(*d* + *k*). A by-product of the scheme is a data structure, permitting a rapid search for the *k* nearest neighbors among {*x*_{j}} for an arbitrary point . The cost of each such query is proportional to *T*·(*d*·(log *d*) + log(*N*/*k*)·*k*·(*d* + log *k*)), and the memory requirements for the requisite data structure are of the order *N*·(*d* + *k*) + *T*·(*d* + *N*). The algorithm utilizes random rotations and a basic divide-and-conquer scheme, followed by a local graph search. We analyze the scheme’s behavior for certain types of distributions of {*x*_{j}} and illustrate its performance via several numerical examples.

In this paper, we describe an algorithm for finding approximate nearest neighbors (ANNs) in *d*-dimensional Euclidean space for each of *N* user-specified points {*x*_{j}}. For each point *x*_{j}, the scheme produces a list of *k* “suspects” that have high probability of being the *k* closest points (nearest neighbors) in the Euclidean metric. Those of the suspects that are not among the “true” nearest neighbors are close to being so.

We present several measures of performance (in terms of statistics of the *k* chosen suspected nearest neighbors), for different types of randomly generated datasets consisting of *N* points in . Unlike other ANN algorithms that have been recently proposed (see, e.g., ref. 1), the method of this paper does not use locality-sensitive hashing. Instead we use a simple randomized divide-and-conquer approach. The basic algorithm is iterated several times and then followed by a local graph search.

The performance of any fast ANN algorithm must deteriorate as the dimension *d* increases. Although the running time of our algorithm only grows as *d*· log *d*, the statistics of the selected approximate nearest neighbors deteriorate as the dimension *d* increases. We provide bounds for this deterioration (both analytically and empirically), which occurs reasonably slowly as *d* increases. Although the actual estimates are fairly complicated, it is reasonable to say that in 20 dimensions the scheme performs extremely well, and the performance does not seriously deteriorate until *d* is approximately 60. At *d* = 100, the degradation of the statistics displayed by the algorithm is quite noticeable.

An outline of our algorithm is as follows:

Choose a random rotation, acting on , and rotate the

*N*given points.Take the first coordinate and divide the dataset into two boxes, where the boxes are divided by finding the median in the first coordinate.

On each box from step 2, we repeat the subdivision on the second coordinate, obtaining four boxes in total.

We repeat this on coordinates 3, 4, etc., until each of the boxes has approximately

*k*points.We do a local search on the tree of boxes to obtain approximately

*k*suspects for each point*x*_{j}.The above procedure is iterated

*T*times, and for each point*x*_{j}, we select from the*T*·*k*suspects the*k*closest discovered points for*x*_{j}.Perform a local graph search on the collections of suspects, obtained in step 6 (we call this local graph search “supercharging”). Among

*k*^{2}“candidates” obtained from the local graph search, we select the best*k*points and declare these “the suspected approximate nearest neighbors,” or suspects.

The data structure generated by this algorithm allows one to find, for a new data point ** y**, the

*k*suspected approximate nearest neighbors in the original dataset. This search is quite rapid, as we need only follow the already generated tree structure of the boxes, obtained in the steps listed above. One can easily see that the depth of the binary tree, generated by steps 1 through 4, is log

_{2}(

*N*/

*k*). This means that we can use the

*T*trees generated and then pass to step 7.

Almost all known techniques for solving ANN problems use tree structures (see, e.g., refs. 1–3). Two apparently unique features of our method are the use of fast random rotations (step 1), and the local graph search (step 7), which dramatically increases the accuracy of the scheme. We use the fast Fourier transform to generate our random rotations, and this accounts for the factor of log *d* that appears in the running time. Our use of random rotations replaces the usual projection argument used in other ANN algorithms, where one projects the data on a random subspace. As far as we know, the use of fast rotations for applications of this type appears in ref. 3 (see ref. 4 and the references therein for a brief history). The use of random rotations (as in our paper) or random projections (as used elsewhere in ANN algorithms) takes advantage of the same underlying phenomenon; namely the Johnson–Lindenstrauss lemma. (The JL lemma roughly states that projection of *N* points on a random subspace of dimension *C*(*ε*)·(log *N*) has expected distortion 1 + *ε*; see, e.g., ref. 5.) We have chosen to use random rotations in place of the usual random projections generated by selecting random Gaussian vectors. The fast random rotations require *O*(*d*·(log *d*)) operations, which is an improvement over methods using random projections (see refs. 6 and 7).

The *N* × *k* lookup table arising in step 7 is the adjacency matrix of a graph whose vertices are the points {*x*_{j}}. In step 7 we perform a depth one search on this graph and obtain ≤ *k* + *k*^{2} candidates (of whom we select the suspects). This accounts for the factor of *k*^{2} in the running time. Because of degradation of the running time, we have chosen not to perform searches of depth greater than one.

The algorithm has been tested on a number of artificially generated point distributions. Results of some of those tests are presented below.

The paper is organized as follows. In the first section, we summarize the mathematical and numerical facts to be used in subsequent sections. In the second section, we describe the randomized approximate nearest neighbors algorithm (RANN) and analyze its cost and performance. In the third section, we illustrate the performance of the algorithm with several numerical examples.

## Mathematical Preliminaries

In this section, we introduce notation and summarize several facts to be used in the rest of the paper.

### Degree of Contact.

Suppose that *L* > 0 is a positive integer. Suppose further that [1]are two words of symbols +, - of length *L*. We define the degree of contact Con(** σ**,

**) between**

*μ***and**

*σ***to be the number of positions at which the corresponding symbols are different. In other words, [2]In a mild abuse of notation, we say that two disjoint sets**

*μ**A*

_{σ}and

*A*

_{μ}(or their elements) have degree of contact

*j*if Con(

**,**

*σ***) =**

*μ**j*. For example,

*x*and

*y*have degree of contact 1 if

*x*∈

*A*

_{σ},

*y*∈

*A*

_{μ}, and

**,**

*σ***differ at precisely one symbol.**

*μ*### Pseudorandom Orthogonal Transformations.

In this section, we describe a fast method (presented in refs. 6 and 7) for the generation of random orthogonal transformations and their application to arbitrary vectors.

Suppose that *d*, *M*_{1}, *M*_{2} > 0 are positive integers. We define a pseudorandom *d*-dimensional orthogonal transformation Θ as a composition of *M*_{1} + *M*_{2} + 1 linear operators [3]

The linear operators with *j* = 1,…,*M*_{1} + *M*_{2} are defined in the following manner: We generate permutations *π*_{1},…,*π*_{M1+M2} of the numbers {1,…,*d*}, uniformly at random and independent of each other. Then for all , we define by the formula [4]In other words, permutes the coordinates of the vector ** x** according to

*π*

_{j}. can be represented by a

*d*×

*d*matrix

*P*

_{j}, defined by the formula [5]for

*k*,

*l*= 1,…,

*d*. Obviously, the operators are orthogonal.

The linear operators with *j* = 1,…,*M*_{1} + *M*_{2} are defined as follows: We construct (*d* - 1)·(*M*_{1} + *M*_{2}) independent pseudorandom numbers, *θ*_{j}(1),…,*θ*_{j}(*d* - 1) with *j* = 1,…,*M*_{1} + *M*_{2}, uniformly distributed in (0,2*π*). Then we define the auxiliary linear operator for *k* = 1,…,*d* - 1 to be the planar rotation of the coordinates *k* and *k* + 1 by the angle *θ*_{j}(*k*). In other words, for all [6]and the rest of the coordinates of *Q*_{j,k}(** x**) coincide with those of

**. We define by the formula [7]Obviously, the operators are orthogonal.**

*x*The linear operator is defined as follows: First suppose that *d* is even and that *d*_{2} = *d*/2. We define the *d*_{2} × *d*_{2} discrete Fourier transform matrix *T* by the formula [8]where *k*,*l* = 1,…,*d*_{2} and . The matrix *T* represents a unitary operator . We then define the one-to-one linear operator by the formula [9]for all . Eventually, we define *F*^{(d)} by the formula [10]for even *d*. If *d* is odd, we define *F*^{(d)}** x** for all by applying

*F*

^{(d-1)}to the first

*d*- 1 coordinates of

**and leaving its last coordinate unchanged. Obviously, the operators**

*x**T*,

*Z*, defined by refs.

**8**and

**9**, respectively, preserve the norm of any vector . Therefore,

*F*

^{(d)}is a real orthogonal transformation .

The cost of the generation of a random permutation (see, e.g., ref. 8) is *O*(*d*) operations. The cost of the application of each to a vector is obviously *d* operations due to Eq. **4**.

The cost of generation of *d* - 1 uniform random variables is *O*(*d*) operations. Also, the cost of application of each to a vector is *O*(*d*) operations due to Eq. **7**.

Finally, the cost of the fast discrete Fourier transform is *O*(*d*· log *d*) operations, and the cost of the application of *F*^{(d)} to a vector is *O*(*d*· log *d*) operations due to Eqs. **8**–**10**.

Thus the cost of the generation of Θ defined via Eq. **3** is [11]Moreover, the cost of application of Θ to a vector is also given by Eq. **11**.

It was observed that if *M*_{1} and *M*_{2} in Eq. **3** are chosen such that *M*_{1} + *M*_{2} ∼ log *d*, then the distribution of Θ is close to the uniform distribution on the group of orthogonal transformations from to .

The use of the Hadamard matrix (without 2 × 2 rotations) appears in a related problem studied by Ailon and Liberty in ref. 9.

## Randomized Approximate Nearest Neighbors Algorithm

In this section, we describe the nearest neighbor problem and present a fast randomized algorithm for its solution.

### The Nearest Neighbor Problem.

Suppose that *d* and *k* < *N* are positive integers and suppose that [12]is a collection of *N* points in . We are interested in finding the *k* nearest neighbors of each point *x*_{i}. The distance between the points is the standard Euclidean distance.

For each *x*_{i}, one can compute in a straightforward manner the distances to the rest of the points and thus find the nearest neighbors. However, the total cost of the evaluation of the distances alone is *O*(*d*·*N*^{2}), which makes this naive approach prohibitively expensive when *N* is large. We propose a faster approximate algorithm for the solution of this problem.

### Informal Description of the Algorithm.

#### Initial Selection.

The key idea of our algorithm is the following simple (and well known) observation: Suppose that for each *x*_{i} we have found a small subset *V*_{i} of *B* such that a point inside *V*_{i} is more likely to be among the *k* nearest neighbors of *x*_{i} than a point outside *V*_{i}. Then it is reasonable to look for the nearest neighbors of each *x*_{i} only inside *V*_{i} and not among all the points. The nearest neighbors of *x*_{i} in *V*_{i}, which can be found by direct scanning, are referred to as its suspected approximate nearest neighbors, or suspects, as opposed to the true nearest neighbors {*x*_{t(i,j)}}.

Of course, many of the *k* true nearest neighbors of *x*_{i} might not be among its suspects. However, one can reselect *V*_{i} to obtain another list of *k* suspects of *x*_{i}. The initial guess is improved by taking the “best” *k* points out of the two lists. This scheme is iterated to successively improve the list of suspects of each *x*_{i}.

The performance of the resulting iterative randomized algorithm admits the following crude analysis: Suppose that the size of *V*_{i} is *α*·*N*, with *α* ≪ 1. Suppose also that the number of the true nearest neighbors of *x*_{i} inside *V*_{i} is roughly *β*·*k*, with *α* < *β* < 1. If the choice of *V*_{i} is fairly random, then order *O*(1/*β*) iterations of the algorithm are required to find most of the true nearest neighbors of each *x*_{i}. Temporarily neglecting the cost of the construction of *V*_{i}, this results in *O*((*α*/*β*)·*d*·*N*^{2}) operations instead of *O*(*d*·*N*^{2}) operations for the naive algorithm. If *α* ≪ *β*, the improvement can be substantial.

Our construction of *V*_{i}’s is based on geometric considerations. First, we shift all of the points to place their center of mass at the origin and apply a random orthogonal linear transformation on the resulting collection. Then, we choose a real number *y*(1) such that the first coordinate of half of the points in *B* is less than *y*(1). In other words, [13]We divide all the points into two disjoint sets [14]Obviously, the sizes of *B*_{-} and *B*_{+} are the same if *N* is even and differ by one if *N* is odd. Next, we set *y*_{+}(2) to be a real number such that the second coordinate of half of the points in *B*_{+} is less than *y*_{+}(2), i.e., [15]We split *B*_{+} into two disjoint sets *B*_{+-} and *B*_{++} by the same principle, e.g., [16]We construct *B*_{--} and *B*_{-+} in a similar fashion by using a real number *y*_{-}(2) such that the second coordinate of half of the points in *B*_{-} is less than *y*_{-}(2), i.e., [17]Each of the four boxes *B*_{--}, *B*_{-+}, *B*_{+-}, *B*_{++} contains either ⌊*N*/4⌋ or ⌊*N*/4⌋+1 points. Then we repeat the subdivision by splitting each of the four boxes into two by using the third coordinate, and so on. We proceed until we end up with a collection of 2^{L} boxes {*B*_{σ}} with between *k* and 2*k* points in each box. The indices ** σ** are words of symbols +,- of length

*L*(see Eq.

**1**), where

*L*is a positive integer such that

*k*·2

^{L}≤

*N*<

*k*·2

^{L+1}. In other words,

*L*is defined via the inequality [18]Obviously, the sets {

*B*

_{μ}} constitute a complete binary tree of length

*L*, whose nodes are indexed by words

**of symbols +,- of length up to**

*μ**L*. The set

*B*is at the root of this tree, the sets

*B*

_{-}and

*B*

_{+}are at the second level, and so on. At the last level of the tree, there are 2

^{L}boxes. The depth of the tree equals to

*L*+ 1.

The notion of degree of contact (Eq. **2**) extends to the collection {*B*_{σ}} of boxes. Suppose that *x*_{i} is in *B*_{σ}. Obviously, the higher degree of contact of two boxes *B*_{σ} and *B*_{μ} is, the less likely a point of *B*_{μ} will be among the *k* nearest neighbors of *x*_{i}. Motivated by this observation, we define the set *V*_{i} as [19]In other words, *V*_{i} is the union of the box *B*_{σ} containing *x*_{i} and *L* boxes whose degree of contact with *B*_{σ} is one. Thus for each *i* = 1,…,*N*, the set *V*_{i} contains about *k*·(*L* + 1) points.

#### Supercharging.

In the previous subsections, we have described an iterative scheme for the selection of suspects (suspected approximate nearest neighbors) for each of the points *x*_{i} in *B*. Suppose now that after *T* iterations of this scheme, the list *x*_{s(i,1)},…,*x*_{s(1,k)} of *k* suspects of each point *x*_{i} has been generated. This list can be improved by a procedure we call supercharging.

The idea of supercharging is based on the following observation: A true nearest neighbor of *x*_{i}, missed by the scheme described above, might be among the suspects of one of *x*_{s(i,1)},…,*x*_{s(1,k)}. This leads to the following obvious procedure.

For each *x*_{i}, we denote by *A*_{i} the list of suspects of all *x*_{s(i,1)},…,*x*_{s(i,k)}. *A*_{i} contains roughly *k*^{2} points, with possible repetitions. We compute the square of the distances from *x*_{i} to each point in *A*_{i} and find the *k* nearest neighbors *x*_{t(i,1,Ai)},…,*x*_{t(i,k,Ai)} of *x*_{i} in *A*_{i}. Then we declare the (updated) suspects of *x*_{i} to be the best *k* points out of the two lists and .

In other words, supercharging is a depth one search on the directed graph, whose vertices are the points {*x*_{i}} and whose *N* × *k* adjacency matrix is the suspects’ indices {*s*(*i*,*j*)}, with *i* = 1,…,*N* and *j* = 1,…,*k*.

#### Overview.

We conclude this section with a list of the principal steps of the algorithm. Given the collection of points in , we perform the following operations:

Subtract from each

*x*_{i}the center of mass of the collection.Choose a random orthogonal linear transformation Θ, and replace

*x*_{i}with Θ(*x*_{i}) for all*i*= 1,…,*N*.Construct 2

^{L}boxes {*B*_{σ}} as described above.For each

*x*_{i}, define the set*V*_{i}via Eq.**19**.Update the suspects

*x*_{s(i,1)},…,*x*_{s(i,k)}of*x*_{i}by using its true nearest neighbors in*V*_{i}.Steps 2–5 are repeated

*T*times.For each

*x*_{i}, perform supercharging.

#### Query for a new point.

Suppose that we are given a new point , and we need to find its *k* nearest neighbors in *B* = {*x*_{1},…,*x*_{N}}. In this subsection, we describe a rapid procedure to find *k* approximate nearest neighbors of ** y**. This procedure uses the following information, available on the

*j*th iteration of the algorithm, for

*j*= 1,…,

*T*:

The orthogonal linear transformation Θ

^{(j)}, generated on the*j*th iteration of the algorithm.The collection of boxes , generated on the

*j*th iteration of the algorithm.

To find *k* approximate nearest neighbors of the new point ** y** among the points of

*B*, we perform the following operations: First, we apply Θ

^{(1)}on

**, where Θ**

*y*^{(1)}is the orthogonal linear transformation of the first iteration of the algorithm. The resulting vector is denoted by

*y*^{(1)}; in other words, [20]Next, in the collection of boxes , generated on the first iteration of the algorithm, we find the box that has degree of contact zero with

*y*^{(1)}. Note that if

**had belonged to**

*y**B*in the first place, then on the first iteration of the algorithm

*y*^{(1)}would have belonged to .

The box has *k* or *k* + 1 points. Suppose that is the union of the boxes having degree of contact zero or one with (see Eq. **19**). Recall that has about (*L* + 1)·*k* points, where *L* is defined via formula **18**.

We define the set in a similar manner, by using the data of the second iteration of the algorithm. We apply the orthogonal transformation Θ^{(2)} of the second iteration on *y*^{(1)} to obtain *y*^{(2)}, i.e., [21]due to Eq. **20**. In the boxes of the second iteration, we find the box , having degree of contact zero with *y*^{(2)}. Similar to , the set has about (*L* + 1)·*k* points.

We repeat this procedure to construct the sets for *j* = 3,4,…,*T*, where *T* is the number of the iterations of the algorithm. Each contains roughly (*L* + 1)·*k* points.

Finally, we define the set *A* to be the union of all the sets ; in other words, [22]The set *A* contains about *T*·*k*·(*L* + 1) points. The *k* nearest neighbors of ** y** inside

*A*are declared to be the approximate nearest neighbors of

**inside**

*y**B*. We note that to construct

*A*we need to store the corresponding data on each iteration of the algorithm.

Once the *k* suspects of ** y** in

*B*have been found, they can be improved by performing supercharging on

**only.**

*y*### Cost Analysis.

In this subsection, we analyze the cost of the algorithm in terms of number of operations. Also, we analyze the memory requirements of the algorithm. We recall that *x*_{1},…,*x*_{N} is a collection of *N* points in and *N* ≈ *k*·2^{L}.

Centralizing the points costs

*O*(*N*·*d*).Generating a pseudorandom orthogonal transformation and applying it to

*N*points costs*O*(*N*·*d*·(log*d*)) (see Eq.**11**).Constructing the binary tree of boxes of depth

*L*= log_{2}(*N*/*k*) costs*O*(*N*·*L*).The suspects of each point

*x*_{i}are selected by computing the distances from*x*_{i}to*k*·(*L*+ 1) points in*V*_{i}(see Eq.**19**) and finding*k*minimal distances. Thus the cost of this step is*O*(*N*·*L*·*k*·(*d*+ log*k*)).In supercharging, about

*k*^{2}points are scanned to update the suspects of each point. Thus the cost of supercharging is*O*(*N*·*k*^{2}·(*d*+ log*k*)).

We conclude that the total cost of the algorithm is [23]where *T* is the number of iterations. We observe that for fixed dimension *d* and number of required nearest neighbors *k*, the cost is *O*(*T*·*N*· log *N*), as opposed to *O*(*N*^{2}) of the naive approach. Also, the cost of supercharging is quadratic in the number of nearest neighbors for fixed dimension *d* and number of points *N*, which makes supercharging expensive relative to a single iteration of the principal part of the algorithm even for moderate *k*.

A query for a new point ** y** consists of performing all the steps of the algorithm on one point only. Consequently, because of Eq.

**23**the total cost of a query for a new point is [24]To determine the memory requirements of the algorithm, we observe that to store

*N*points in we need

*O*(

*N*·

*d*) memory, to store the indices of

*k*nearest neighbors of each of

*N*points we need

*O*(

*N*·

*k*) memory, and to store a tree of boxes (with the corresponding orthogonal transformation) we need

*O*(

*N*) memory.

We must distinguish between two cases. In the first case, given *N* points {*x*_{i}} in , we are interested in finding *k* nearest neighbors for each *x*_{i} only. In other words, no query for a new point will ever be requested. Then, the tree of boxes can be destroyed after each iteration. Hence, in this case the algorithm uses *O*(*N*·(*d* + *k*)) memory. In other words, in this case the memory requirements are minimal, in the sense that most of the memory is spent on the storage of input and output of the algorithm only.

In the second case, we know in advance that queries for new points will be requested. Therefore we need to store *T* trees of boxes and the corresponding orthogonal transformations. Thus, when queries for new points are allowed, the total memory requirements are of the order [25]

### Performance Analysis.

In this subsection, we discuss the performance analysis of the randomized approximate nearest neighbor algorithm in the case of standard Gaussian distribution of the points.

To be more specific, we consider the collection of *N* independent standard Gaussian *d*-dimensional random vectors *x*_{1},…,*x*_{N}, where the number of points is given by the formula [26]for some positive integer *L* > 0. We recall that for each point *x*_{i}, the algorithm approximates the *k* true nearest neighbors of *x*_{i} by *k* suspects . In order to analyze the quality of this approximation, we introduce a number of statistical quantities. First, we define the average square of the distance from *x*_{i} to its *k* true nearest neighbors by the formula [27]Next, we define the average square of the distance from *x*_{i} to its *k* suspects by the formula [28]Finally, we define the proportion of the true nearest neighbors of *x*_{i} among its suspects by the formula [29]We analyze a single iteration of the algorithm (selection of suspects without supercharging) by evaluating the expected values of Eqs. **27**–**29** numerically (see Theorems 1, 2, and 3 below). On the other hand, the performance of the algorithm can be studied empirically by running the algorithm on artificially generated sets of points.

In the rest of this subsection, we outline the results of the analysis (see ref. 10 for details). We start by introducing some notation.

Suppose that *d* > 0 is a positive integer, and *λ*,*α*,*β* > 0 are positive real numbers. The distributions , *χ*^{2}(*d*,*λ*), and *B*(*α*,*β*) are defined by their probability density functions, given respectively via the formulas [30][31][32]Suppose that *a* > 0 is a positive real number. Suppose also that *d*≥*L* > 0 are positive integers, and is a vector all of whose coordinates are positive. We define the functions via the formulas [33][34][35]Also, we define the function via the formula [36]The following theorem provides an analytical formula for the expectation (see Eq. **27**):

Suppose that *d*,*k*,*N* > 0 are positive integers. Suppose further that is defined by Eq. **27**. Then its expectation is given by the formula [37]where the functions , *f*_{B(i,N-i)} are defined respectively by Eqs. **30** and **32**, and is the inverse of the cumulative distribution function (cdf) of *χ*^{2}(*d*,*λ*) (see Eq. **31**).

The following theorem provides an approximation to the expectation (see Eq. **28**). The error of this approximation is verified via numerical experiments (see ref. 10 for more details).

Suppose that *k* > 0 and *d*≥*L* > 0 are positive integers, and *N* is a positive integer defined via [Eq. **26**. We define the real number by the formula [38]where the function is defined via Eq. **30**, the function is the inverse of *G*_{a} defined via Eq. **36**, the set a subset of the *d*-dimensional hypersphere of radius , defined by the formula [39]and the average (Avg) is taken with respect to the (*d* - 1)-dimensional area. Then, [40]In other words, Eq. **40** holds if we fix *d*,*L* and let *k* → ∞.

The following theorem provides an approximation to the expectation (see Eq. **29**). The error of this approximation is verified via numerical experiments (see ref. 10 for more details).

Suppose that *k* > 0 and *d*≥*L* > 0 are positive integers, and *N* is a positive integer defined via Eq. **26**. We define the real number *P*_{appr} by the formula [41]where the function is defined via Eq. **30**, the function is the inverse of the cdf of *χ*^{2}(*d*,*λ*), defined via Eq. **31**, the function *G*_{a} is defined via Eq. **36**, the set is defined via Eq. **39**, and the Avg is taken with respect to the (*d* - 1)-dimensional area. Then, [42]where the real random variable prop_{N} is defined via Eq. **29**. In other words, Eq. **42** holds, if we fix *d*,*L* and let *k* → ∞.

## Numerical Results

This section has two principal purposes. First, we numerically evaluate the expectations of Eqs. **27**–**29** by using Theorems 1–3 above. Second, we demonstrate the performance of the algorithm empirically by running it on sets of points, generated according to the Gaussian distribution. The choice of uniform distribution on the hypercube [0,1]^{d} or the Hamming cube {0,1}^{d} instead of Gaussian results in very similar performance (see ref. 10 for results and details).

The algorithm has been implemented in FORTRAN (Lahey 95 Linux version). The numerical experiments have been carried out on a Lenovo laptop computer, with DualCore CPU 2.53 GHz and 2.9GB RAM.

### Experiment 1.

In this experiment, we choose *k* = 30, and for *L* = 10, 15, 20, 25 set *N* via Eq. **26**. Then, for different values of *d* between *L* and 110, we approximate the expectations , , via the numerical evaluation of Eqs. **37**, **38**, and **41**], respectively. The approximations are denoted by , , *P*_{num}, respectively, and are accurate to roughly two decimal digits. Also, we compute the ratio [43]The results of experiment 1 are shown in Fig. 1. In Fig. 1 (*Left*), we plot as a function of the dimensionality *d*, for each value of *L*. In Fig. 1 (*Right*), we plot *P*_{num} as a function of the dimensionality *d*, for each value of *L* (the number of points *N* is determined via Eq. **26**). This quantity estimates the average proportion of the suspects among the true nearest neighbors (found by one iteration of RANN without supercharging).

Several observations can be made from experiment 1 (see Fig. 1) and from more detailed experiments by the authors.

The ratio , defined via Eq.

**43**, slowly increases with the number of points*N*, for fixed values of number*k*of nearest neighbors and dimensionality*d*(see Fig. 1,*Left*). In other words, the performance of RANN deteriorates with the number of points, if terms of . However, as number of points is multiplied by 2, this ratio grows by a factor less than 1.1 for most values of*d*in Fig. 1 (*Left*).The ratio actually decreases with dimensionality

*d*, for fixed*k*and*N*(see Fig. 1,*Left*). For example, for*N*≈ 10^{6}, this ratio decays from the value of about 1.57 at*d*= 20 to roughly 1.3 at*d*= 110. This observation is not surprising, because the number of points within the same fixed relative distance (e.g., twice the distance to true nearest neighbors) grows with the dimensionality.The proportion of true nearest neighbors among suspects decays with

*d*for fixed*k*,*N*, as expected (see Fig. 1,*Right*). However, even for*d*= 40 and*N*≈ 10^{6}, RANN correctly finds about 2.7% of true nearest neighbors, on merely one iteration without supercharging. In other words, after 50 iterations without supercharging RANN will correctly detect about [44]true nearest neighbors, for*d*= 40 and*N*≈ 10^{6}(see also experiment 2 below).

### Experiment 2.

In this experiment, we run RANN with various parameters on different sets of points and report on the resulting statistics.

We choose the dimensionality *d* and the number of points *N*. Then we choose the number of nearest neighbors *k*, the number of iterations of the algorithm *T* > 0, and whether to perform the supercharging or not. Next, we generate *N* independent and identically distributed standard Gaussian vectors *x*_{1},…,*x*_{N} in .

We run RANN on *x*_{1},…,*x*_{N}. For each *x*_{i}, RANN finds its *k* suspects, *x*_{s(i,1)},…,*x*_{s(i,k)}. Also, for all *i* = 1,…,2,000, we find the list *x*_{t(i,1)},…,*x*_{t(i,k)} of *k* true nearest neighbors of *x*_{i}, by direct scanning. Then, we compute the quantities , , prop_{i}, defined via Eqs. **27**–**29**], respectively, for *i* = 1,…,2,000. We define the average by the formula [45]The averages , prop_{algo} are defined and computed by the same token.

Generation of the points and invocation of RANN are repeated 20 times, to obtain the values , all defined via Eq. **45**. Then, we define the sample mean of via the formula [46]The sample means and are defined in a similar way. Finally, we compute the ratio [47]The parameters for experiment 2 were chosen in the following way: The number of points was *N* = 30·2^{12} ≈ 10^{5}. The number of requested nearest neighbors was *k* = 15, 30, or 60. The dimensionality *d* was chosen to be a multiple of 5 between 15 and 200. The number of iterations of the algorithm was either *T* = 1 or *T* = 10. The supercharging step was either skipped or performed once after *T* iterations.

Most of the approximations have been computed with relative error up to 2%. The results are shown in Figs. 2 and 3 and in Table 1. In Figs. 2 and 3 (*Left*), we plot ratio_{smpl} (see Eq. **47**) as a function of the dimensionality *d* for *k* = 15 and *k* = 60, respectively. In Figs. 2 and 3 (*Right*), we plot as a function of the dimensionality *d* for *k* = 15 and *k* = 60, respectively. Each figure contains four curves, corresponding to one iteration without supercharging, one iteration with supercharging, 10 iterations without supercharging, and 10 iterations with supercharging.

Table 1 has the following structure: The first column contains the dimensionality *d*. The next three columns contain the running time of 10 iterations of RANN without supercharging, with the requested number of nearest neighbors being *k* = 15, 30, 60, respectively. The last three columns contain the running time of the supercharging only (after 10 iterations of RANN), with the requested number of nearest neighbors being *k* = 15, 30, and 60, respectively. The running time is shown in seconds.

Several observations can be made from Table 1 and Figs. 2 and 3 and from more detailed experiments by the authors. In these observations, we refer to ratio_{smpl} as “ratio” and to as “proportion.” We recall that, roughly speaking, the proportion measures how many of the true nearest neighbors have been found by RANN. On the other hand, the ratio measures how much average distances to suspects differ from the average distances to true nearest neighbors.

As expected, for a fixed

*d*the performance of RANN improves as the number*T*of iterations increases, both in terms of ratio and proportion.As expected, for a fixed

*d*the performance of RANN improves if supercharging is performed, both in terms of ratio and proportion.For a fixed

*d*, the performance of RANN improves as the number of requested nearest neighbors*k*increases, both in terms of ratio and proportion (at the expense of running time).For a fixed

*d*, the effects of supercharging (especially on proportion) increase as*k*grows. For example, in Fig. 2 (*Right*) (*k*= 15) we observe, that, for*T*= 10 and*d*= 60, supercharging increases the proportion from 22 to 32%. On the other hand, in Fig. 3 (*Right*) (*k*= 60) we observe that, for*T*= 10 and*d*= 60, supercharging increases the proportion from 43 to 74%.As expected, the performance of RANN slowly deteriorates in terms of proportion, as

*d*increases. On the other hand, there is no significant deterioration of performance in terms of ratio, as*d*increases.For

*T*= 10 the ratio is always below 1.1, with or without supercharging.Even for as high a dimension as

*d*= 60, as few as 10 iterations with supercharging correctly determine at least 30% of the true nearest neighbors. Moreover, the error of this detection decays exponentially with number of iterations*T*.The running time of RANN, with or without supercharging, grows roughly linearly with dimensionality

*d*, as expected from Eq.**23**(see Table 1).For fixed dimensionality

*d*, the running time of RANN without supercharging grows roughly linearly with the requested number of nearest neighbors*k*, as expected from Eq.**23**(see Table 1). For example, 10 iterations of RANN in the case of*d*= 200 take about 80, 124, and 208 s for*k*= 15, 30, and 60, respectively.For fixed dimensionality

*d*, the running time of the supercharging step grows roughly quadratically with the requested number of nearest neighbors*k*, as expected from Eq.**23**(see Table 1). For example, in the case of*d*= 200 supercharging takes about 15, 57, and 235 s for*k*= 15, 30, and 60, respectively.The algorithm has been tested on sets of points, having the uniform distribution in the

*d*-dimensional cube [0,1]^{d}or the uniform distribution on the Hamming cube [0,1]{0,1}^{d}). In cases, the performance of the algorithm was very similar to that in the Gaussian case (see ref. 10 for details).The algorithm has been tested on sets of points, having Gaussian distribution whose covariance matrix is not the identity matrix. These tests seem to indicate that the performance of RANN improves as the condition number of the covariance matrix grows. As an extreme example, if the points belong to a

*q*-dimensional linear subspace of , the performance of the algorithm does not depend on*d*(though the running time obviously does).

## Acknowledgments

This work was partially supported by Division of Mathematical Sciences Grant 0802635, Office of Naval Research Grants N000140910108, N000140910340, and N00014-10-1-0570, and Air Force Office of Scientific Research Grant FA9550-09-1-02-41.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: peter.jones{at}yale.edu.

Author contributions: P.W.J., A.O., and V.R. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2008.

The authors declare no conflict of interest.

## References

- ↵
- Andoni A,
- Indyk P

- ↵
- ↵
- ↵
- Ailon N,
- Chazelle B

- ↵
- Johnson W,
- Lindenstrauss J

- ↵
- Rokhlin V,
- Tygert M

- ↵
- Liberty E,
- Woolfe F,
- Martinsson PG,
- Rokhlin V,
- Tygert M

- ↵
- Knuth D

- ↵
- Ailon N,
- Liberty E

- ↵
- Jones PW,
- Osipov A,
- Rokhlin V

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics