Skip to main content
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian
  • Log in
  • My Cart

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses

New Research In

Physical Sciences

Featured Portals

  • Physics
  • Chemistry
  • Sustainability Science

Articles by Topic

  • Applied Mathematics
  • Applied Physical Sciences
  • Astronomy
  • Computer Sciences
  • Earth, Atmospheric, and Planetary Sciences
  • Engineering
  • Environmental Sciences
  • Mathematics
  • Statistics

Social Sciences

Featured Portals

  • Anthropology
  • Sustainability Science

Articles by Topic

  • Economic Sciences
  • Environmental Sciences
  • Political Sciences
  • Psychological and Cognitive Sciences
  • Social Sciences

Biological Sciences

Featured Portals

  • Sustainability Science

Articles by Topic

  • Agricultural Sciences
  • Anthropology
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology
  • Cell Biology
  • Developmental Biology
  • Ecology
  • Environmental Sciences
  • Evolution
  • Genetics
  • Immunology and Inflammation
  • Medical Sciences
  • Microbiology
  • Neuroscience
  • Pharmacology
  • Physiology
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences
  • Sustainability Science
  • Systems Biology
Colloquium Paper

Improving massive experiments with threshold blocking

Michael J. Higgins, View ORCID ProfileFredrik Sävje, and Jasjeet S. Sekhon
PNAS July 5, 2016 113 (27) 7369-7376; first published July 5, 2016; https://doi.org/10.1073/pnas.1510504113
Michael J. Higgins
aDepartment of Statistics, Kansas State University, Manhattan, KS 66506;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Fredrik Sävje
bDepartment of Economics, Uppsala University, SE-751 85 Uppsala, Sweden;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Fredrik Sävje
Jasjeet S. Sekhon
cDepartment of Political Science,University of California, Berkeley, CA 94720;
dDepartment of Statistics, University of California, Berkeley, CA 94720
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: sekhon@berkeley.edu
  1. Edited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved December 9, 2015 (received for review June 15, 2015)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Abstract

Inferences from randomized experiments can be improved by blocking: assigning treatment in fixed proportions within groups of similar units. However, the use of the method is limited by the difficulty in deriving these groups. Current blocking methods are restricted to special cases or run in exponential time; are not sensitive to clustering of data points; and are often heuristic, providing an unsatisfactory solution in many common instances. We present an algorithm that implements a widely applicable class of blocking—threshold blocking—that solves these problems. Given a minimum required group size and a distance metric, we study the blocking problem of minimizing the maximum distance between any two units within the same group. We prove this is a nondeterministic polynomial-time hard problem and derive an approximation algorithm that yields a blocking where the maximum distance is guaranteed to be, at most, four times the optimal value. This algorithm runs in O(n log n) time with O(n) space complexity. This makes it, to our knowledge, the first blocking method with an ensured level of performance that works in massive experiments. Whereas many commonly used algorithms form pairs of units, our algorithm constructs the groups flexibly for any chosen minimum size. This facilitates complex experiments with several treatment arms and clustered data. A simulation study demonstrates the efficiency and efficacy of the algorithm; tens of millions of units can be blocked using a desktop computer in a few minutes.

  • experimental design
  • blocking
  • big data
  • causal inference

Properly executed experiments with random assignment guarantee that estimated treatment effects are equal to the true causal effects of interest in expectation. However, only one assignment is realized for a particular experiment, and there could be chance differences between treatment and control groups that muddle any comparison. Indicative of such differences are imbalances in observed baseline characteristics between the treatment groups. For example, in a medical study on the effect that a drug has on life expectancy, it may occur by chance that the control group is older and sicker than the treatment group. Whenever imbalances in prognostically important covariates are observed, there is reason to suspect that the resulting estimates are inaccurate. Studies that do not attended to this issue cannot be considered to follow the gold standard of randomized experiments (1): Viewed before assignment, investigators allowed for unnecessarily high variance; viewed after assignment, they allowed the estimator to be biased conditional on the observed distribution of covariates.

Since R. A. Fisher’s canonical treatment (2), “blocking” has been the default experimental design to deal with this problem. With this design, the investigator forms groups of units, or “blocks,” that are as similar as possible. Treatments are then randomly assigned in fixed proportions within blocks and independently across them. This prevents imbalances in observed covariates, which can increase precision if these covariates are predictive of outcomes.

Unadjusted estimates for even massive experiments are often too variable to enable reliable inferences because the effects of interest may be small and distributional issues may result in surprisingly large variances. A prominent case is A/B testing of the effectiveness of online advertising (3). The effects of the advertisements are generally very small (although economically relevant due to the low costs), and consumers’ behaviors tend to follow distributions with fat tails (4).

Moreover, with the rise of massive data, researchers, policy makers, and industry leaders have become increasingly interested in making fine-grained inferences and targeting treatments to subgroups (5). The recent focus on personalized and precision medicine is a noteworthy example (6). Even with large experiments, subgroups of interest often lack data because of the vagaries of random assignment and the curse of dimensionality. Blocking enables researchers to define the subgroups of interest ex ante. This ensures that there will be sufficient data to make fine-grained inferences.

Finally, because blocking adjusts for covariates in the design of the study, it limits both the need for and the effect of adjusting the experiment ex post. Such adjustments often lead to incorrect test levels if investigators specify models based on the observed treatment assignments (7), or if they pick models based on test results—a habit that appears to be prevalent (8).

In short, blocking is an essential tool for experiential design. It enables one to follow Fisher’s advice that Nature should be asked one question at a time, which is the central motivation for random assignment in the first place (9).

Despite its providence, usefulness, and wide applicability, there are many situations where an effective blocking design is desirable but where none is possible or feasible. In particular, current blocking algorithms have primarily focused on the special case where blocks with exactly two units are desired, the so-called “matched-pair design” (10). There exist optimal, polynomial time algorithms for this design, such as nonbipartite matching (11), but they are limited to experiments with only two treatment conditions. Although there exist heuristic algorithms that can facilitate larger block sizes, their theoretical properties are unknown, and their performance has not been fully evaluated (12). In many cases, even with relatively modest samples and considerable computational power, several years would be required to obtain results using algorithms with a proven level of optimality. For the largest of experiments, existing algorithms are too computationally demanding even for the matched-pair design.

In this paper, we introduce the first algorithm, to our knowledge, that produces guaranteed near-optimal blockings for any desired block size. Specifically, we consider the blocking problem where one wants to minimize the greatest within-block dissimilarity, as measured by an arbitrary distance metric, subject to a minimum required block size. We prove that this problem is nondeterministic polynomial-time hard (NP-hard), and we provide an approximation algorithm that, in the worst case, produces a solution that is four times greater than the optimum. The algorithm uses computational resources very efficiently: It is guaranteed to terminate in linearithmic time with linear space complexity. This makes it applicable in many cases where existing algorithms are impractical, including experiments with large samples or multiarmed treatment schemes.

In additional to large data, our approximation algorithm is likely to perform well in traditional, smaller experiments, not the least when designs other than matched pairs are desired. Our formulation of the blocking problem, threshold blocking, differs from the past literature in that it allows for some flexibility in the block structure. This leads to blockings that respect natural clusters of units that may improve performance.

Blocking as a Graph Partition Problem

A blocking of an experiment’s sample is a partition of its units into disjoint sets, referred to as blocks. The blocking problem is to find a blocking where units assigned to the same block are as similar as possible—either to minimize differences on prognostically important covariates or to facilitate the study of subgroups of interest. In the former case, when treatments are assigned in fixed proportions within blocks, blocking reduces imbalances between treatment groups and improves the precision of estimated effects.

Blocking problems can be viewed as graph partitioning problems (11, 13). Each experiment yields a weighted graph where vertices represent units in the sample. Edges connect each pair of units, and edge costs are measured dissimilarity between corresponding units (e.g., the Euclidean or Mahalanobis distance between their covariate vectors). Minimizing within-block edge costs when this graph is partitioned subject to a cardinality condition is equivalent to deriving an optimal blocking. In the matched-pair design, the objective is to minimize the sum of all within-block edge costs subject to that each block contains exactly two vertices.

To improve blockings and facilitate the approximation algorithm, we consider a formulation of the blocking problem that differs from the past literature in three aspects. First, we facilitate designs other than matched pairs by allowing for any desired block size. Second, we consider blockings where each block is required to contain at least the desired number of units. Such threshold blockings have several advantages compared with the “fixed-sized” blockings derived by previous methods, where blocks are forced to be exactly of the desired size. Every fixed-sized blocking is also a threshold blocking; hence, for any sample and objective function, the optimal solution for the latter case is guaranteed to be at least as good as in the former (14). In particular, fixed-sized blocks might not respect natural clusterings of units, and one is sometimes forced to assign similar units to different blocks just to satisfy the cardinality condition.

Third, we consider a “bottleneck” objective function. That is, we wish to find a blocking that minimizes the maximum within-block edge cost—making the two least similar units assigned to the same block as similar as possible. The bottleneck objective has some advantages over the commonly used sum (or average) objective. Going back to at least Cochran, statisticians have observed that a few large imbalances are often more problematic than many small ones, especially when blocking is combined with ex post adjustments (15). Furthermore, parallel to monotonic imbalance bounding in observational studies (16), controlling the maximum imbalance within a block guarantees that the average imbalance cannot exceed this maximum after treatments are assigned. If an infinity norm is used to measure dissimilarity (i.e., the Chebyshev distance), this also applies to each covariate in isolation. Minimizing sums or averages does not provide such guarantees. Finally, bottleneck optimization problems often have approximate solutions that can be found efficiently (17). Although the algorithm cannot readily be extended to other objective functions, it has a local optimality property that provides good performance with respect to the average within-block edge cost.

The Bottleneck Threshold Blocking Problem

Let k denote a threshold for the minimum block size. Consider the complete graph G=(V,E) describing an experimental sample, where V denotes the set of n vertices (the experimental units) and E denotes the set of edges connecting all pairs of vertices. (Refer to Graph Theoretical Definitions for graph theoretical terminology and notation used in this paper.) For each ij∈E, there is an associated cost, cij, indicating the dissimilarity between i and j; lower costs mean that units are more similar. We require that these costs satisfy the triangle inequality∀ij,jℓ,iℓ∈E,cij+cjℓ≥ciℓ.[1]This ensures that the direct route between two vertices is no longer than a detour through a third vertex. All distance metrics fulfill this criterion by definition.

Definition 1: A threshold blocking with threshold k is a partition b={V1,⋯,Vm} of V where each block satisfies the size threshold∀Vx∈b,|Vx|≥k.[2]

Definition 2: The subgraph generated by a blocking b={V1,…,Vm}, denoted G(b)=[V,E(b)], is the union of subgraphs of G induced by the components of b; that is, an edge ij∈E(b) only if i and j are in the same block,E(b)≡{ij∈E:∃Vx∈b,i,j∈Vx}.[3]Let Bk denote the set of all possible threshold blockings of G with a threshold of k. The bottleneck threshold blocking problem is to find a blocking in Bk such that the maximum within-block dissimilarity is minimized. This amounts to finding an optimal blocking b∗∈Bk such that the largest edge cost in G(b∗), is as small as possible; let λ denote this minimum,maxij∈E(b∗)cij=minb∈Bkmaxij∈E(b)cij≡λ.[4]Definition 3: An α-approximation algorithm for the bottleneck threshold blocking problem derives a blocking b∈Bk with a maximum within-block cost no larger than αλ,maxij∈E(b)cij≤αλ.[5]In Proof of NP-Hardness, we show that, unless P=NP, no polynomial-time (2−ϵ)-approximation algorithm exists for any ϵ>0. Therefore, the problem is NP-hard, and finding an optimal solution is computationally intractable except for special cases or very small samples.

An Approximately Optimal Blocking Algorithm

We present a 4-approximation algorithm for the threshold blocking problem. Outside of an initial construction of a nearest neighbors graph, this algorithm has O(kn) time and space complexity. Hence, it can be used in experiments with millions of units. Although the algorithm guarantees a threshold blocking with maximum within-block cost no larger than 4λ, simulations indicate that derived blockings are much closer to the optimum in practice.

The Algorithm.

Given the graph representation of the experimental sample, G=(V,E), and a prespecified threshold k, the approximate blocking algorithm proceeds as follows:

  • 1. Construct a (k−1)-nearest neighbor subgraph of G. Denote this graph Gnn=(V,Enn).

  • 2. Find a maximal independent set of vertices, S, in the second power of the (k−1)-nearest neighbor subgraph, Gnn2. Vertices in S are referred to as the block seeds.

  • 3. For each seed i∈S, create a block comprising its closed neighborhood in Gnn, Vi=NGnn[i].

  • 4. For each yet unassigned vertex, assign it to any block that contains one of its adjacent vertices in Gnn.

When the algorithm terminates, the collection of blocks, balg={Vi}i∈S, is a valid threshold blocking of the experimental units that satisfies the optimality bound.

Informally, the algorithm constructs the blocking by selecting suitable vertices, the seeds, from which the blocks are grown. Seeds are spaced sufficiently far apart so as not to interfere with each other’s growth, but they are dense enough so that all nonseed vertices have a seed nearby. Specifically, the second step of the algorithm prevents any vertex from being adjacent to two distinct seeds in the (k−1)-nearest neighbor subgraph, but also never more than a walk of two edges away from a seed. This ensures that the seeds’ closed neighborhoods do not overlap, and that vertices assigned to the same block are at a close geodesic distance. Fig. 1 illustrates how the algorithm constructs blocks in an example sample.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

An illustration of the approximation algorithm for a sample with 2D covariate data when a minimum block size of two is desired (k=2). (A) The algorithm is provided with a set of data points and forms the graph by drawing an edge between all possible pairs of units. The edges are here omitted to ease presentation. (B) A (k−1)-nearest neighbor subgraph is constructed. (C) The second power of the nearest neighbor subgraph is derived, as shown by the edges, and a maximal independent set is found, as shown by the red vertices (the seeds). (D) All vertices adjacent to a seed in the nearest neighbor subgraph are included in the blocks formed by the seeds, as shown by the edges marked in red. (E) The two yet unassigned vertices are assigned to the blocks that contain one of their adjacent vertices in the nearest neighbor subgraph. (F) The final blocking.

Validity and Complexity.

We first prove that the algorithm is guaranteed to produce a valid threshold blocking, and then examine its time and space complexity.

Lemma 1. For any nonseed vertex, i∉S:

  • 1. There exist no two seeds both adjacent to i in Gnn.

  • 2. There exists a walk in Gnn of two or fewer edges from i to the seed of the block that i is assigned to.

Proof: The lemma follows from S being a maximal independent set in Gnn2. Refer to Proof of Lemma 1 for a complete proof.

Theorem 1. (Validity) The blocking algorithm produces a threshold blocking: balg∈Bk.

Proof: By Lemma 1, each vertex assigned in the third step is adjacent to exactly one seed; thus it will be in exactly one block. In the fourth step, vertices are assigned to exactly one block each. This ensures that the blocks are disjoint and span V; thus balg is a partition of V.

All seeds have at least k−1 adjacent vertices in Gnn. In the third step, these vertices and the seeds themselves will form the blocks, ensuring that each block contains at least k vertices. This satisfies Definition 1.

Theorem 2. (Complexity) The blocking algorithm terminates in polynomial time using O(kn) space.

Proof: Naively, the (k−1)-nearest neighbor subgraph can be constructed by sorting each vertex’s edge costs and finding its k−1 nearest neighbors. Thus, Gnn is constructed in, at most, O(n2⁡log⁡n) time (18). To enable constant time access to the neighbors of any vertex, store the nearest neighbor subgraph in n lists containing each vertex’s edges. There can be, at most, (k−1)n edges in Gnn. This implies an O(kn) space complexity for the edge lists.

Using the edge lists, a maximal independent set in the second power of Gnn can be found in O(kn) time without changing the space complexity. See Subroutine for the Second Step of the Algorithm for additional details. The third step is completed within O(n) time as Lemma 1 ensures that, at most, n units will be assigned to blocks in this step and the edge lists enable constant time access to the seeds’ neighbors. In the fourth step, it will never be necessary to search through all edge lists more than once, implying a complexity of O(kn).

Remark 1: After the initial construction of the (k−1)-nearest neighbor subgraph, the algorithm terminates in O(kn) time. As the nearest neighbor search problem is well studied, the naive subroutine in the proof can be improved on in most applications. In particular, most experiments will have reasonably low-dimensional metric spaces. Using specialized algorithms, the subgraph can, in that case, be constructed in O(kn⁡log⁡n) expected time (19) or worst-case time (20). If the covariates are not few to begin with, it is often advisable to use some dimensionality reduction technique before blocking so as to extract the most relevant information.

Run time can also be improved by using an approximate nearest neighbor search algorithm. However, approximate optimality is not guaranteed in that case.

Remark 2: It is rarely motivated to increase the block size as the sample grows; thus k can be considered fixed in the typical experiment. When k is fixed and one can use a specialized procedure to derive the nearest neighbor subgraph, the algorithm has O(n⁡log⁡n) time and O(n) space complexity.

Approximate Optimality.

To prove the optimality bound, we will first show that the edge costs in the (k−1)-nearest neighbor subgraph are bounded. As the algorithm ensures that vertices in the same block are at a close geodesic distance in that subgraph, approximate optimality follows from the triangle inequality.

Lemma 2. No edge cost in Gnn can be greater than the maximum cost in the optimal blocking,∀ij∈Enn,cij≤λ.[6]Proof: Consider the graphGλ=(V,Eλ={ij∈E:cij≤λ}).[7]For all edges in an optimal blocking, ij∈E(b∗), we have cij≤λ from optimality. It follows that E(b∗)⊆Eλ.

Let c+=max{cij:ij∈Enn} and considerG+=(V,E+={ij∈E:cij<c+}).[8]The minimum degree of this graph, δ(G+), must be less than k−1. If not, a (k−1)-nearest neighbor graph exists as a subgraph of G+. As this new graph does not contain c+, it is contradictory that c+ is the maximum edge cost in Gnn.

Suppose that c+>λ. It then follows that Eλ⊆E+; thusδ[G(b∗)]≤δ(Gλ)≤δ(G+)<k−1.[9]That is, there exists a vertex in G(b∗) with fewer than k−1 edges. It follows that there must exist a block in G(b∗) with fewer than k vertices and, as Definition 1 then is violated, it cannot be a valid blocking. The contradiction proves that c+≤λ which bounds all edges in Enn.

Theorem 3. (Approximate optimality) The blocking algorithm is a 4-approximation algorithm,maxij∈E(balg)cij≤4λ.[10]Proof: Let balg denote the blocking produced by the algorithm. Consider any within-block edge ij∈E(balg). We must show that cij is bounded by 4λ.

If ij∈Enn, we have cij≤λ by Lemma 2. If ij∉Enn and i∉S, j∈S, then, by Lemma 1, there exists some ℓ so that iℓ,ℓj∈Enn. Lemma 2 applies to both these edges. By Eq. 1, the triangle inequality, it follows,cij≤ciℓ+cℓj≤λ+λ=2λ.[11]If ij∉Enn and i,j∉S, let ℓ∈S be the seed in the block that vertices i and j are assigned to. From above, we have ciℓ,cℓj≤2λ, and, by the triangle inequality,cij≤ciℓ+cℓj≤2λ+2λ=4λ.[12]As there is exactly one seed in each block, i,j∈S is not possible, and we have considered all edges in E(balg).

Remark 3: In some settings, a slight reduction in the sample size is acceptable or required, e.g., for financial constraints or when blocks are constructed before units are sampled using secondary data sources. In these cases, the algorithm can easily be altered into a 2-approximation algorithm. By terminating at the end of the third step and disregarding the unassigned vertices, one ensures that all remaining vertices are, at most, a distance of λ from the seed (where λ refers to the maximum distance in the optimal blocking of the selected subsample). Applying the triangle inequality proves that all edge costs in the blocking of the subsample are bounded by 2λ. It is also possible to apply a caliper to the blocking so as to restrict the maximum possible edge cost by excluding some hard-to-block vertices.

A concern when using a bottleneck objective is that densely populated regions of the sample space will be ignored, as the blocks in these regions will not affect the maximum edge cost. This is especially worrisome when there are a few hard-to-block vertices that result in a large λ. This can lead to poor performance, as covariate balance often can be improved by ensuring good block assignments for all vertices. However, as the presented algorithm does not directly use the bottleneck objective to form the blocks, it avoids this issue. Instead, its optimality follows from the use of the nearest neighbor subgraph as the basis of blocking, and from this graph’s connection with the optimal edge cost as shown in Lemma 2.

The following theorem shows that our algorithm leads to approximate optimality not only in the complete sample but also in all subsamples. Thus, if there is a densely populated region, the algorithm ensures that the blocking is near optimal also within that region.

Theorem 4. (Local approximate optimality) Let bsub⊆balg be any subset of blocks from a blocking constructed by the algorithm. Define Vsub=∪Vx∈bsubVx as the set of all vertices contained in the blocks of bsub. Let λsub denote the maximum edge cost in an optimal blocking of Vsub. The subset of blocks is an approximately optimal blocking of Vsub,maxij∈E(bsub)cij≤4λsub.[13]Proof: Theorem 4 is proven in Proof of Theorem 4.

Heuristic Improvements.

The algorithm allows for several improvements of heuristic character. Although the guaranteed optimality bound or complexity level remains unchanged, we expect these changes to improve general performance. In particular, the algorithm has a tendency to construct blocks that are too large. Although flexibility in the block size is beneficial—the main idea behind threshold blocking—the current version tends to overuse that liberty.

The first improvement exploits an asymmetry in the nearest neighbor subgraph that currently is disregarded. The cardinality condition is met as each seed and its k−1 nearest neighbors are assigned to the same block. However, in addition to those necessary neighbors, the current version assigns vertices that have the seed as their nearest neighbor to the block. With some minor alterations, detailed in Algorithm Using Nearest Neighbor Digraphs, the algorithm can use a (k−1)-nearest neighbor digraph to form the blocks. This digraph is such that an arc (i.e., directed edge) is drawn from i to j if j is among the (k−1) closest neighbors of i. Using the digraph, one can differentiate whether an edge indicates that a vertex is a neighbor of the seed or vice versa. Fig. S1 illustrates the difference between the undirected and directed versions of the algorithm.

Fig. S1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. S1.

An illustration of the directed version of the blocking algorithm with a comparison with the undirected version. (A) The directed algorithm creates a nearest neighbor digraph by drawing an arc from each vertex to its closest neighbor. (A′) The undirected algorithm draws the same graph but disregards the directionality of the edges. (B) The directed version finds seeds (red vertices) so that no two seeds point toward the same vertex. (B′) As the undirected version disregards the direction of the edges, it requires that the shortest path between any two seeds in the nearest neighbor graph is at least three edges regardless of whether that is needed to satisfy the block-size threshold. It, thereby, misses one possible seed in this example. (C) Blocks are formed with the seeds’ closest neighbors. (C′) Blocks are formed both with the seeds’ closest neighbors and with vertices that have the seeds as their closest neighbors. (D and D′) Remaining vertices are assigned to the block containing their closest neighbor.

There is rarely a unique maximal independent set in the second power of the nearest neighbor graph (i.e., the seeds). The current version selects one arbitrarily. The second improvement is to choose the seeds more deliberately. As each seed assigns at least k−1 vertices to its block, a straightforward way to reduce the block sizes is to maximize the number of seeds—a larger set is expected to produce better blockings. The ideal may be the maximum independent set, but deriving such a set is an NP-hard problem. Most heuristic algorithms are, however, expected to perform well.

Third, despite the above improvements, the algorithm will occasionally produce blocks that are much larger than k. Whenever a block contains 2k or more vertices, it can safely be split into two or more blocks, each containing at least k vertices. As the algorithm ensures that all edge costs satisfy the optimality bound and no edges are added by splitting, this can only lower the maximum within-block cost. In Greedy Threshold Algorithm, we describe a greedy threshold blocking algorithm for splitting blocks that runs fast and is expected to perform well. This algorithm can also be used to block the complete sample, but will not perform on par with the approximation algorithm.

A fourth improvement changes how vertices are assigned in the fourth step. With larger desired block sizes, some blocks may contain peripheral vertices that are far from their seeds. In these cases, it is often beneficial to assign the remaining vertices in the fourth step to the block containing their closest seed instead of the block containing a closest neighbor. This avoids the situation where a vertex is assigned to distant block due to having a peripheral vertex close by. The optimality bound is maintained as Theorem 3 ensures that at least one seed exists at a distance of at most 2λ. If a vertex is not assigned to that seed, it must have been assigned to a seed that is at a closer distance.

Finally, once a blocking is derived, searching for moves or swaps of vertices between blocks can lead to improvements as in other partitioning problems (21). It is, however, seldom feasible to let such searches continue until no additional improvements are possible (i.e., until a local optimal is found), as the flexible block structure allows for a vast number of moves and swaps.

Simulation Study

We provide the results from a small simulation study. Apart from the original algorithm, we include versions that use the improvements discussed in Heuristic Improvements. Specifically, we include a version using the nearest neighbor digraph (the first improvement) and a version using the first three improvements. A version that uses all four improvements is generally only beneficial with larger block sizes and is included when such settings are investigated in Tables S1–S3.

View this table:
  • View inline
  • View popup
Table S1.

Performance of blocking algorithms when k=4 by sample size: Maximum (Max.) and average (Avg.) within-block distances relative to approximation algorithm and average block size

View this table:
  • View inline
  • View popup
Table S2.

Root mean square error relative to approximation algorithm when k=4 by sample size

View this table:
  • View inline
  • View popup
Table S3.

Root mean square error when k=4 by sample size, without normalization

For comparison, we also include the greedy threshold blocking algorithm discussed in Greedy Threshold Algorithm and the currently best-performing greedy fixed-sized blocking algorithm (12). When k=2, we can also include a commonly used implementation of the nonbipartite matching algorithm (11).

We investigate a simple setting where each data point is sampled independently from a uniform distribution over a two-dimensional plane,x1,x2≈U(0,10),[14]and similarity is measured as the Euclidean distance on this plane. Although many experiments will have data with higher dimensions, it is often not motivated to include all those dimensions when deriving blocks. Typically, one wants to reduce the dimensionality in a preprocessing step to extract the information that is most predictive of the potential outcomes (22). The investigated algorithms are, however, not restricted to low-dimensional data.

All simulations are run on a single CPU core reflecting the performance of a modern desktop computer. See Implementation and Hardware for details about the implementation of the algorithm and the hardware used to run the simulations.

Run Time and Memory.

To investigate the resource requirements, we let each algorithm block samples containing between 100 and 100 million data points generated from the model above. Each setting was replicated 250 times. As time and memory use are quite stable over the runs, these replications suffice to investigate even small differences. In these simulations, the directed version performs almost identically to the original version, and it is omitted from the graphs to ease presentation.

Fig. 2 presents the time and memory needed for the algorithms to successfully terminate. All versions of the approximation algorithm run fast and terminate within a minute for samples sizes up to a few millions. With 100 million data points, the original version terminates within 11 min, whereas the version will all three refinements does so in less than 16 min—all very manageable in real applications. Memory use is increasing linearly at a slow rate for both versions. A modern desktop computer would have enough memory to block samples with tens of millions of units.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

Run time (A and B) and memory use (C and D) of five blocking algorithms with 2D input data over a range of sample sizes. Marker symbols are actual simulation results, and the connecting lines are interpolations. Results are presented with different scales due to the large differences in performance. Results are presented for all algorithms for sample sizes up to 40,000 data points (A and C), whereas results for sample sizes up to 100 million data points are only shown for the two approximation algorithms (B and D). No simulations were successful for the greedy algorithms and nonbipartite matching for sample sizes larger than 20,000, due to excessive run time or memory use. The undirected version of the approximation algorithm (shown in red) has almost identical run time and memory use as the directed version, as described in Heuristic Improvements, and its results are not shown in the figure. See Table S4 for detailed results.

The three comparison algorithms paint another picture altogether. For the rather modest sample size of 20,000 data points, these algorithms take more than 20 min—up to 2 h—to terminate. Even more problematic is their extensive memory use. For samples larger than 50,000 data points, all three algorithms try to allocate more than 48 gigabytes of memory; under these settings, these algorithms do not terminate successfully, and no results can be shown.

Detailed results for these simulations and simulations with input data with higher dimensionality are presented in Tables S4–S6.

View this table:
  • View inline
  • View popup
Table S4.

Run time and memory use for blocking algorithms by sample size: 2D input data

View this table:
  • View inline
  • View popup
Table S5.

Run time and memory use for blocking algorithms by sample size: 5D input data

View this table:
  • View inline
  • View popup
Table S6.

Run time and memory use for blocking algorithms by sample size: 10D input data

Minimizing Distances.

To investigate how well the algorithms minimize distances, we increase the number of replications to 5,000; this statistic is less stable, and the differences between algorithms are smaller. However, with this number of replications, no difference can be attributed to simulation error.

The first three data columns of Table 1 show the maximum within-block distance, averaged over the simulation rounds, when the desired minimum block size is two. Refer to Table S1 for results when the desired minimum block size is four. We report values normalized by the performance of the approximation algorithm to ease interpretation. The two improved versions of the approximation algorithm lead to quite substantial decreases in the maximum distance. All versions of the approximation algorithm outperform the two greedy algorithms—for the fixed-sized version, drastically so. However, only the version with all three improvements performs better than nonbipartite matching.

View this table:
  • View inline
  • View popup
Table 1.

Performance of blocking algorithms by sample size: Maximum (Max.) and average (Avg.) within-block distances relative to approximation algorithm and average block size

In the fourth through sixth data columns of Table 1, the average within-block distance is presented. This measure is the objective of the greedy fixed-sized algorithm and nonbipartite matching, so it is not surprising that these algorithms show better performance. Nonbipartite matching is, here, the best performing algorithm, and only the approximation algorithm with all three improvements outperforms the fixed-sized greedy algorithm.

The seventh through ninth data columns in Table 1 present the average size of the blocks produced by the different algorithms. The improvements discussed in Heuristic Improvements are shown to be effective in taming the original algorithm’s tendency to construct blocks that are too large. However, smaller blocks do not automatically lead to better performance. This is evident from the greedy threshold algorithm, which produces smaller blocks than the approximation algorithms but has worse performance. The two fixed-sized blocking algorithms produce blocks of constant size by construction.

Reducing Uncertainty.

To investigate how the blockings affect an estimator’s performance, one must specify a data-generating process for the outcome. The results are highly sensitive to the details of this process. Even blockings that are optimal with respect to within-block distances need not lead to the lowest variance. An extensive investigation is beyond the scope of this paper, and we provide only indicative results.

We consider a simple setting with two treatment conditions when the desired minimum block size is two. Refer to Table S2 for results when the minimum block size is four. The outcome (y) is the product of the covariates with additive, normally distributed noise,y=x1x2+ε,ε≈N(0,1).[15]Note that the treatment does not enter into the model and thus has no effect—the potential outcomes are equal. The covariates are highly predictive in this model—more so than one can expect in a real application. This enables the methods to make the most out of the covariate information and helps us differentiate between the algorithms’ performances. For all blocking methods, we will use a difference-in-means estimator that is weighted by the size of each block to estimate treatment effects. Refer to Estimation Methods or ref. 23 for additional details on estimation with threshold blocking. We ran 5,000 replications in this setting, and results are presented relative to the approximation algorithm. Tables S3 and S7 present the results without normalization.

View this table:
  • View inline
  • View popup
Table S7.

Root mean square error when k=2 by sample size, without normalization

Table 2 presents results for the root of the average squared difference between estimates and the true treatment effect (RMSE) for each method’s estimator. All blocking methods with proven optimality level perform well. For the smaller sample sizes, the approximation algorithm with all three improvements performs best, and nonbipartite matching is slightly better in the larger samples. All blocking methods seem, however, to converge as the sample size grows.

View this table:
  • View inline
  • View popup
Table 2.

Root mean square error relative to approximation algorithm by sample size

In addition to the six blocking methods, Table 2 includes the results of two methods that do not use blocking. In both cases, the RMSE is markedly higher than in any of the methods using blocking. When controlling for imbalances using ordinary least square regression (OLS), the RMSE is at least twice as large as that of any blocking method with proven optimality. When the estimate is completely unadjusted for imbalances, the RMSE is up to 20 times higher than for the approximation algorithm. However, this difference certainly overstates the benefits of blocking that one can expect in real applications, as the covariates are unusually predictive in this simulation.

Discussion

Our approximation algorithm enables large and complex experiments to make use of blocking. No feasible algorithm with proven optimality properties has been available for massive experiments. Although many of the commonly used blocking algorithms run in polynomial time, none run in quasilinear time as the approximation algorithm does. Polynomial time is not a sufficient condition for tractability with very large data (24). One can see this in our simulations.

The threshold blocking algorithm is expected to perform well in most cases, given its approximate optimality. However, nonbipartite matching, when it is feasible, is likely the best choice in experiments with matched-pair designs because it is exactly optimal. Our simulation results seem to point toward this conclusion as well. The matched-pair design is, however, limited to the case of only two treatment conditions, and the design complicates the estimation of SEs because block sizes of larger than two are sometimes needed, for example, to estimate conditional variances. Our approximation algorithm is therefore an important arrow in the quiver of experimental design.

Whenever blocking reduces imbalances in prognostically important covariates, blocking will improve the expected precision of estimates. In some settings, even when the covariates contain no information about the outcomes, blocking cannot increase the variance of the treatment effect estimator compared with when no blocking is done (25). However, theoretical results depend on the randomization model, estimand, and estimator used. There are some rare instances where blocking may decrease precision.

As an alternative to blocking, some advocate rerandomization when a given randomization results in poor balance in observed covariates (26, 27). Rerandomization restricts the randomization scheme, as assignments with poor balance are ruled out. If the rule for which randomizations are acceptable is precise and set a priori, randomization inference is well defined. One worries, however, that researchers will not use well-specified rules that they will later recall to restrict randomization distributions. Especially if the initial randomization results in good balance, it is doubtful that investigators will adjust their test levels as they had planned.

Far more common than blocking or rerandomization are ex post methods of adjusting experimental data such as poststratification or using a model-based estimator that incorporates covariate information. Such methods can work well. For example, poststratification is nearly as efficient as blocking: The difference in their variances is on the order of 1/n2, with a constant depending on treatment proportion (28). However, poststratification can increase variance if the number of strata is large and the strata are poorly chosen. Regression adjustment can provide significant gains in precision (29, 30), and model-based hypothesis tests can be asymptotically valid even when the adjustment model is misspecified (31). However, regression adjustment, like poststratification, may increase the finite sample variance, and will do so, on average, for any sample size, if the covariates are not informative (32).

A key argument in favor of blocking as opposed to ex post adjustment is that one is increasing the transparency of the analysis by building covariate adjustment into the experimental design. The results cited regarding poststratification and model adjustment assume that the investigator did not pick the strata or model as a function of the realized treatment assignment. One further assumes that the investigator does not run a number of adjustment models and then only report the one with the desired results. Human nature being what it is, this assumption is probably optimistic. A major benefit of randomized experiments, aside from the randomization, is that the design stage is separated from the analysis stage by construction (33). Blocking allows one to use design-based estimators that adjust for covariate information (34). The less that there is to do at the analysis stage, the less likely it is that the investigator will fish for particular results, unconsciously or not.

When researchers select adjustment models based on observed p-values, it is called p-hacking, and the habit appears to be prevalent (8). Because of concerns about p-hacking, there has been a move toward creating preanalysis plans for experimental studies in both medicine and the social sciences. Such plans force researchers to lay out in advance how they will analyze the experiment and what subgroups and outcomes are of primary interest. Unfortunately, evidence from medicine, where the practice is best established, shows that preanalysis plans are often ignored and allow significant leeway in selection of covariates, subgroups, outcomes, and adjustment methods, and readers and reviewers are rarely informed of departures (35). Blocking allows one to encode into the design of a study the covariates the investigator, a priori, thinks are important. After randomization, one may still adjust for these variables, as small imbalances may remain. Blocking then acts as an effective signal that the investigator intended to do such adjustments before seeing initial results. Moreover, some covariates may not be measured at the time of randomization, and they could be adjusted ex post, although concerns about p-hacking may arise.

Finally, blocking is motivated by partial knowledge about how the covariates relate to the outcomes. The performance of any blocking algorithm depends on how well the chosen similarity measure captures this relationship. As this choice is subject-specific, general recommendations are hard to come by. However, as noted in our remarks, if one has many covariates, some dimension reduction to those that most likely relate to the outcomes is often advantageous. If more complete knowledge exists, such as a good estimate of the potential outcomes under control, one would gain more precision by directly blocking on that estimate.

There are no free lunches in statistics, but blocking comes close. It has few downsides and risks relative to complete randomization, other than computational challenges, and any experimenter should be motivated to block their sample (23). In this paper, we have enabled the technique for experiments where it previously was infeasible.

Graph Theoretical Definitions

Let G=(V,E) be an arbitrary graph.

Complete graph. G is complete if ij∈E for any two vertices i,j∈V.

Spanning. A graph G′=(V,E′) is a spanning subgraph of G if they contain the same set of vertices and E′⊆E.

Induced. A subgraph G′=G[V′]=(V′,E′) is induced on G by V′⊆V if G′ contains all edges in ij∈E that connect vertices in V′ and no other edges,E′≡{ij∈E:i,j∈V′}.

Adjacent. Vertices i and j are adjacent in G if ij∈E.

Degree. The degree of vertex i in G is its number of adjacent vertices,deg(i)≡|{j∈V:ij∈E}|.

Minimum degree. The minimum degree, δ(G), of G is the minimum degree among the vertices in G,δ(G)≡mini∈Vdeg(i).

Neighborhood. A neighborhood of vertex i, NG(i), in G is the set of vertices adjacent to i,NG(i)≡{j∈V:ij∈E}.A closed neighborhood also contains i: NG[i]≡NG(i)∪i.

Independent set. A set of vertices I⊆V is independent in G if no vertices in the set are adjacent,∄ i,j∈I such that ij∈E.

Maximal independent set. An independent set of vertices I in G is maximal if, for any additional vertex i∈V, the set i ∪ I is not independent,∀i∈V∖I,∃j∈I such that ij∈E.The maximum independent set in G is the maximal independent set with the largest cardinality among all independent sets in G.

Walk. A walk from i=k0 to j=km of length m in G is a (m+1)-tuple of vertices, with an edge between each adjacent pair, connecting i and j,(k0,k1,…,km):∀1≤ℓ≤m,kℓ−1kℓ∈E.

Power. The dth power of G is a graph Gd=(V,Ed) where an edge ij∈Ed if there exists a walk from i to j in G of d or fewer edges.

Partition. A partition of V is a set of subsets p={V1,…,Vm} satisfying

  • 1. (Nonempty) ∀Vx∈p,Ø≠Vx⊆V,

  • 2. (Disjoint) ∀Vx,Vy∈p,(Vx≠Vy)⇒(Vx∩ Vy=Ø),

  • 3. (Spanning) ∪Vx∈pVx=V.

Nearest neighbor subgraph. The k-nearest neighbor subgraph of G is a subgraph Gnn=(V,Enn) where an edge ij∈Enn only if j is one of the k nearest vertices to i or i is one of the k nearest vertices to j,Enn={ij∈E:(i,j)∈Ednn∨(j,i)∈Ednn}.

Nearest neighbor digraph. The k-nearest neighbor digraph of G is a directed subgraph Gdnn=(V,Ednn) where an arc (i,j)∈Ednn only if j is one of the k nearest vertices to i. Rigorously, for a vertex i, let i(ℓ) denote the vertex j that corresponds to the ℓth smallest value of {cij:ij∈E} (where ties are broken arbitrarily but deterministically),ci i(1)≤ci i(2)≤…,then,Ednn={(i,j):ij∈E∧j∈{i(ℓ)}ℓ=1k}.

Proof of NP-Hardness

We prove NP-hardness by showing that a partition problem considered by Kirkpatrick and Hell (36) can be reduced to the bottleneck threshold blocking problem. For an arbitrary graph G=(V,E), the PART[{Kt:t≥k}] problem asks whether there exists a set of subgraphs G={G1=(V1,E1),⋯,Gm} of G such that {V1,…,Vm} is a partition of V and each subgraph Gx∈G is isomorphic to a complete graph with k or more vertices, Kt≥k. This problem is NP-complete for any k≥3 (36).

Theorem S1. Let ϵ>0. The bottleneck threshold blocking problem with minimum block size k≥3 does not allow for a polynomial time (2−ϵ)-approximation algorithm unless P=NP.

Proof: Consider an instance of the PART[{Kt:t≥k}] problem on a graph G=(V,E). Create a weighted complete graph, H=(V,E′), that shares the vertex set with G. For all edges ij∈E′, let the corresponding costs becij={1,if ij∈E,2,if ij∉E.[S1]These costs satisfy the triangle inequality. Note that, for any cost ratio higher than 2 between the two types of edges, the triangle inequality will be violated for some input graph.

We consider solving the bottleneck threshold blocking problem on H. We show that the minimum maximum within-block cost λ=1 if and only if PART[{Kt:t≥k}] is true. Because λ∈{1,2}, it follows that λ=2 if and only if PART[{Kt:t≥k}] is false. Hence, any (2−ϵ)-approximation algorithm will produce a blocking with maximum within-block distance of 1 if and only if PART[{Kt:t≥k}] is true, and, hence, can be used to solve PART[{Kt:t≥k}]. Thus, such an algorithm terminates in polynomial time only if P=NP.

Suppose λ=1. Consider an optimal blocking b∗={V1,…,Vm} of H and the set of subgraphs induced on the input graph by the components of the blocking,G={G[V1],G[V2],…,G[Vm]}.[S2]All ij∈E′(b∗) have cost cij=1 and so must exist in G. Thus, all blocks induced on G must be isomorphic to a complete graph. Furthermore, as H and G share the same vertex set and b∗ partitions H, G partitions G. It follows that G is a valid {Kt:t≥k} partition and PART[{Kt:t≥k}] is true.

Suppose PART[{Kt:t≥k}] is true. Let G={G1=(V1,E1),⋯,Gm=(Vm,Em)} denote a {Kt:t≥k} partition. Consider a blocking constructed from the vertex sets of G: p={V1,⋯,Vm}. As all subgraphs in G are complete, the within-block edges are exactly E1∪E2∪⋯∪Em. As these edges exist in G, the corresponding edge costs in H are 1. Thus, the maximum edge cost in the blocking given by b is λ=1.

Proof of Lemma 1

Lemma 1 states that:

  • 1. For any i∉S, there exists no two seeds both adjacent to i in Gnn,

∀j,ℓ∈S,ij∉Enn ∨ iℓ∉Enn.[S3]

  • 2. For any i∉S, let j∈S denote the seed of the block that i is assigned to. There exists a walk of two or fewer edges from i to j in Gnn,

ij∈Enn∨∃ℓ such that iℓ,ℓj∈Enn.[S4]If the first statement is false, then there exists a walk of two edges between j and ℓ, going through i. This implies that j and ℓ are adjacent in Gnn2, but this contradicts the definition of S as an independent set of Gnn2.

The second statement follows from how vertices are assigned to blocks. Vertices assigned in the third step of the algorithm are, by construction, adjacent to their block seeds in Gnn. Vertices assigned in the fourth step are adjacent in Gnn to a vertex in their block that is assigned in the third step. This vertex (i.e., ℓ) is, in turn, adjacent to the block seed, forming a walk of two edges from i to j. Every vertex unassigned after the third step must be adjacent to at least one vertex in the closed neighborhood of a block seed. Otherwise, there is no walk of two edges from that unassigned vertex to a block seed; hence, that unassigned vertex is independent of all block seeds in Gnn2, contradicting the maximality of S.

Subroutine for the Second Step of the Algorithm

The obvious way to derive the seeds is to construct the second power of the (k−1)-nearest neighbor subgraph and then find a maximal independent set in this graph using conventional algorithms. The second power will, however, not always be a sparse matrix, even when k is fixed, and this procedure will therefore increase time and space complexity. Using the subroutine presented in this section gives a complexity of O(kn) independently of how Gnn is structured.

As discussed in the proof of Theorem 2, we store the (k−1)-nearest neighbor subgraph using edge lists for all vertices. This allows for access to a vertex’s edges with constant time complexity. Consider the following procedure that takes these edge lists as input:

  • 1. Initialize S and A to the empty set.

  • 2. Let i iterate over all vertices in V:

  • (a) If the closed neighborhood of i contains any vertex in A, NGnn[i]∩A≠Ø, continue to the next iteration.

  • (b) Else, set S←S ∪ i and A←A∪NGnn[i].

When this routine terminates, the set S will be a maximal independent set in Gnn2. This can be shown with a proof by contradiction. Note that, at any iteration of the loop, A contains all vertices in S and all of their adjacent vertices.

Suppose that S is not independent. Two i,j∈S that are adjacent in the second power must then exist. That is, either ij∈Enn or, for some ℓ, we have iℓ,ℓj∈Enn. As vertices are added to S sequentially, a state must have existed such that (with arbitrary labeling)i∈S,j∉S, andNGnn[j] ∩ A=Ø.[S5]If not, j would not have been added to S when its iteration came. When i was added to S, it and all of its neighbors were added to A. This implies that if ij∈Enn, then j is in A, and if ℓ exists so that iℓ,ℓj∈Enn, then ℓ∈A. Subsequently, such a state is not possible, and S must be independent.

Suppose that S is not maximal. There must then exist an i∈V∖S that is not adjacent to any vertex in S in Gnn2 when the algorithm terminates. This implies that NGnn[i]∩A=Ø is true. As vertices only are added to A, this must also have been true throughout the run of the algorithm. However, if NGnn[i]∩A=Ø always was true, i would have been added to S in its iteration. The algorithm can therefore not terminate with such a state, and S must be maximal.

To show complexity, note that, in the worst case, one has to check whether each vertex’s closed neighborhood is in A. As there can be, at most, 2(k−1)n entries in the edge lists, there are, at most, O(kn) checks needed. By storing set membership in a random access data structure, set queries are done in O(1) time, which gives a complexity of O(kn) for the complete subroutine.

A straightforward way to incorporate the second heuristic improvement discussed in Heuristic Improvements is to change the order that this subroutine iterates over the vertices.

Proof of Theorem 4

Theorem 4 states that any subset of blocks from a blocking constructed by the algorithm, bsub⊆balg, will be approximately optimal with respect to the blocking problem only containing vertices in the blocks of bsub. Formally, define Vsub=∪Vx∈bsubVx as the set of all vertices in the blocks of bsub. Let λsub denote the maximum edge cost in an optimal blocking of Vsub. Theorem 4 states thatmaxij∈E(bsub)cij≤4λsub.[S6]Let Gsub denote the complete graph on Vsub. Recall that Gnn=(V,Enn) is the (k−1)-nearest neighbor subgraph of G. Let Gind=(Vsub,Eind)=Gnn[Vsub] denote the graph induced by Vsub on Gnn. That is, Eind contains all edges in Enn between vertices that are in Vsub. Finally, let Gsub,nn=(Vsub,Esub,nn) denote the (k−1)-nearest neighbor subgraph of Gsub.

Observe that Eind⊂Esub,nn: if i,j∈Vsub and j is one of the (k−1) nearest neighbors of i in G, then it also must be one of the (k−1) nearest neighbors of i in Gsub⊂G. From Lemma 2, we have that ∀ij∈Esub,nn,cij≤λsub, and so, ∀ij∈Eind,cij≤λsub.

As implied by Lemma 1, there is a walk in Gnn of four or fewer within-block edges between any two vertices contained in the same block in balg. As Gind retains all within-block edges in Gnn of the blocks in bsub, there is a walk of four or fewer between any two vertices in the same block also in Gind. Theorem 3 can, therefore, be applied using Eind in place of Enn. This bounds all within-block edge costs in bsub by 4λsub.

Algorithm Using Nearest Neighbor Digraphs

The approximation algorithm presented in the paper tends to construct too large blocks. A slightly modified version using a k-nearest neighbor digraph will often produce better blockings. In particular, to avoid collisions, one only needs to ensure that no seed adds a vertex to its block that is either a seed itself or that some other seed wants to add to its block. The undirected version enforces that no nonseed vertex wants to add a seed if it was to be a seed—an unnecessary requirement that the digraph avoids.

Redefine NG[i] to be the directed version of a closed neighborhood,NG[i]≡{j∈V:(i,j)∈E}∪i.[S7]Consider the following algorithm that takes the graph, G, describing the experimental sample as input:

  • 1. Construct a (k−1)-nearest neighbor digraph of G so that an arc (i,j) exists if j is among the (k−1) nearest neighbors of i. Denote that graph Gdnn=(V,Ednn).

  • 2. Find a set of vertices, S, so that:

  • (a) There exist no i,j∈S so that (i,j)∈Ednn.

  • (b) There exist no i,j∈S and ℓ∈V∖S so that (i,ℓ)∈Ednn and (j,ℓ)∈Ednn.

  • (c) Adding any i∈V∖S to S would violate either (a) or (b).

  • 3. For each i∈S, form a block with that vertex and all of its adjacent vertices: Vi=NGdnn[i].

  • 4. Assign vertices that are as yet unassigned to a block that contains one of their nearest assigned neighbors. That is, for an unassigned vertex i, assign it to any Vx such that ∃j∈Vx:(i,j)∈Ednn.

An illustration of this algorithm, with a comparison with the undirected version, is given in Fig. S1.

The resulting blocking is approximately optimal for the same reasons as for the original algorithm. The second step ensures that no two seeds have outward-pointing arcs to the same vertex. This makes the blocking disjoint and satisfies the size requirement. The second step also ensures that all vertices are, at most, two arcs (of any directionality) away from their seeds, and, following the same logic as in the proof of Theorem 3, all vertices are at a distance of, at most, 2λ from their seeds. By the triangle inequality, this proves approximate optimality.

The (k−1)-nearest neighbor digraph will have exactly (k−1)n arcs and can thus be stored in O(kn). With only trivial changes, the steps of this algorithm can be done in the same way as in the original version, thus preserving complexity. In particular, the subroutine presented in Subroutine for the Second Step of the Algorithm can still be used to complete the second step when NG[i] is redefined to the directed version as above.

Greedy Threshold Algorithm

The third improvement discussed in Heuristic Improvements is to split blocks that contain 2k or more vertices into smaller blocks. Any algorithm can be used to do this split, as the approximation algorithm ensures that all edges satisfy the optimality bound. One approach would be to use the approximation algorithm once more. However, as the large blocks are a consequence of the structure of the nearest neighbor subgraph, the algorithm will often return the block unchanged.

The greedy algorithm presented in this section seems to perform well in many cases where splitting is desired. The algorithm’s input is an arbitrary valid threshold blocking, b, and it returns a blocking where no edge is greater than in the original blocking and no block contains more than 2k−1 vertices.

Fig. S2 provides pseudocode that describes the algorithm. Informally, it searches among the existing blocks to find a splittable block, i.e., one that contains 2k or more vertices. In such blocks, it finds the two vertices farthest apart and constructs two new blocks based on them. Each of the two vertices picks enough vertices from the original block to fulfill the size requirement, and the remaining vertices are assigned to the vertex that is closest. This is repeated until no block contains 2k or more vertices.

Fig. S2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. S2.

Greedy threshold blocking algorithm. NNV′[i] denotes the union of i and i’s k−1 nearest neighbors in the graph induced on G by vertices V′.

As a blocking with a single block, b={{1,⋯,n}} is a valid threshold blocking; this algorithm can be used to block the whole sample as well.

Implementation and Hardware

The approximation algorithm is implemented in the R and C++ languages using the CHOLMOD library for operations on sparse matrices (37) and the ANN library, written by David M. Mount and Sunil Arya, for nearest neighbor searching. The source code is publicly available at an online code repository and can otherwise be obtained on request.

The simulations were run on the SAVIO computational cluster at University of California, Berkeley using Intel Xeon E5-2670 processors for which each core is clocked at 2.5 GHz. Each round of the simulations was allocated a single core, largely reflecting the performance of an ordinary desktop computer, and was limited to a maximum of 48 GB random access memory.

Estimation Methods

After each algorithm derived their blockings, treatment was assigned independently across blocks using block-balanced complete randomization (23). For a block Vx with an experiment with t treatment conditions, ⌊|Vx|/t⌋ units are randomly assigned to each of the treatments. Then |Vx| (mod t) treatment conditions are picked at random and randomly assigned to the units still without treatments. When t divides all blocks, e.g., when using fixed-sized blockings, this randomization scheme is equivalent to ordinary complete randomization within the blocks.

For methods that use blocking, the block-size weighted difference-in-means estimator was used (23). This estimator first estimates the treatment effect separately within each block, and derives an estimate for the sample by taking a weighted average based on the sizes of the blocks. When a fixed-sized blocking method is used, this estimator is equivalent to the ordinary difference-in-means estimator.

Let T and C collect all units in the two treatment conditions for which the contrast is of interest. Let β^Vx be the estimated treatment effect within block Vx using the ordinary difference-in-means estimator,β^Vx=∑i∈Vx∩ Tyi|Vx∩ T|−∑i∈Vx∩ Cyi|Vx∩ C|.[S8]The estimated treatment effect for the complete sample, β^, is then given by averaging over all blocks, weighted by their size,β^=∑Vx∈b|Vx|nβ^Vx.[S9]The ordinary least squares estimator investigated in addition to the blocking methods adjusts for imbalances in the covariates linearly. That is, the estimator of the contrast between treatments T and C (presuming exactly two treatment conditions) is given by the β^ that solves the following optimization problem:arg min{α^,β^,γ^1,γ^2}∑i=1n (yi−α^−β^ 1[i∈T]−γ^1xi1− γ^2xi2)2.[S10]

Acknowledgments

We thank Peter Aronow, Walter R. Mebane, Jr., Marc Ratkovic, and Yotam Shem-Tov for helpful comments. This research is partially supported by Office of Naval Research Grant N00014-15-1-2367.

Footnotes

  • ↵1To whom correspondence should be addressed. Email: sekhon{at}berkeley.edu.
  • Author contributions: M.J.H., F.S., and J.S.S. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

  • The authors declare no conflict of interest.

  • This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Drawing Causal Inference from Big Data,” held March 26−27, 2015, at the National Academies of Sciences in Washington, DC. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/Big-data.

  • This article is a PNAS Direct Submission.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510504113/-/DCSupplemental.

View Abstract

References

  1. ↵
    1. Rubin DB
    (2008) Comment: The design and analysis of gold standard randomized experiments. J Am Stat Assoc 103(484):1350–1353.
    OpenUrlCrossRef
  2. ↵
    1. Fisher RA
    (1926) The arrangement of field experiments. J Minist Agric G B 33:503–513.
    OpenUrl
  3. ↵
    1. Lewis R,
    2. Rao J
    (2015) The unfavorable economics of measuring the returns to advertising. Q J Econ 130(4):1941–1973.
    OpenUrlCrossRef
  4. ↵
    1. Fithian W,
    2. Wager S
    (2015) Semiparametric exponential families for heavy-tailed data. Biometrika 102(2):486–493.
    OpenUrlCrossRef
  5. ↵
    1. Athey S,
    2. Imbens G
    (2016) Machine learning methods for estimating heterogeneous causal effects. Proc Natl Acad Sci USA 113:7353–7360.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Ashley EA
    (2015) The precision medicine initiative: A new national effort. JAMA 313(21):2119–2120.
    OpenUrlCrossRefPubMed
  7. ↵
    1. Permutt T
    (1990) Testing for imbalance of covariates in controlled experiments. Stat Med 9(12):1455–1462.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Simmons JP,
    2. Nelson LD,
    3. Simonsohn U
    (2011) False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11):1359–1366.
    OpenUrlCrossRefPubMed
  9. ↵
    1. Speed TP
    (1992) Breakthroughs in Statistics, Springer Series in Statistics, eds Kotz S, Johnson NL (Springer, New York), pp 71–81.
  10. ↵
    1. Imai K,
    2. King G,
    3. Nall C
    (2009) The essential role of pair matching in cluster-randomized experiments, with application to the Mexican universal health insurance evaluation. Stat Sci 24(1):29–53.
    OpenUrlCrossRef
  11. ↵
    1. Greevy R,
    2. Lu B,
    3. Silber JH,
    4. Rosenbaum P
    (2004) Optimal multivariate matching before randomization. Biostatistics 5(2):263–275.
    OpenUrlCrossRefPubMed
  12. ↵
    1. Moore RT
    (2012) Multivariate continuous blocking to improve political science experiments. Polit Anal 20(4):460–479.
    OpenUrlCrossRef
  13. ↵
    1. Rosenbaum PR
    (1989) Optimal matching for observational studies. J Am Stat Assoc 84(408):1024–1032.
    OpenUrlCrossRef
  14. ↵
    1. Sävje F
    (2015) The performance and efficiency of threshold blocking. arXiv:1506.02824.
  15. ↵
    1. Cochran WG
    (1965) The planning of observational studies of human populations. J R Stat Soc Ser A 128(2):234–266.
    OpenUrlCrossRef
  16. ↵
    1. Iacus SM,
    2. King G,
    3. Porro G
    (2011) Multivariate matching methods that are monotonic imbalance bounding. J Am Stat Assoc 106(493):345–361.
    OpenUrlCrossRef
  17. ↵
    1. Hochbaum DS,
    2. Shmoys DB
    (1986) A unified approach to approximation algorithms for bottleneck problems. J Assoc Comput Mach 33(3):533–550.
    OpenUrlCrossRef
  18. ↵
    1. Knuth DE
    (1998) Sorting and Searching, The Art of Computer Programming (Addison Wesley Longman, Redwood City, CA), 2nd Ed, Vol 3.
  19. ↵
    1. Friedman JH,
    2. Bentley JL,
    3. Finkel RA
    (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw 3(3):209–226.
    OpenUrlCrossRef
  20. ↵
    1. Vaidya PM
    (1989) An O(n log n) algorithm for the all-nearest-neighbors problem. Discrete Comput Geom 4(1):101–115.
    OpenUrlCrossRef
  21. ↵
    1. Kernighan B,
    2. Lin S
    (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2):291–307.
    OpenUrlCrossRef
  22. ↵
    1. Imbens GW,
    2. Rubin DB
    (2015) Causal Inference for Statistics, Social, and Biomedical Sciences (Cambridge Univ Press, New York).
  23. ↵
    1. Higgins M,
    2. Sävje F,
    3. Sekhon JS
    (2015) Blocking estimators and inference under the Neyman-Rubin model. arXiv:1510.01103.
  24. ↵
    1. National Research Council
    (2013) Frontiers in Massive Data Analysis (Natl Acad Press, Washington, DC).
  25. ↵
    1. Imai K
    (2008) Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Stat Med 27(24):4857–4873.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Hayes RJ,
    2. Moulton LH
    (2009) Cluster Randomised Trials (CRC Press, London).
  27. ↵
    1. Morgan KL,
    2. Rubin DB
    (2012) Rerandomization to improve covariate balance in experiments. Ann Stat 40(2):1263–1282.
    OpenUrlCrossRef
  28. ↵
    1. Miratrix LW,
    2. Sekhon JS,
    3. Yu B
    (2013) Adjusting treatment effect estimates by post-stratification in randomized experiments. J R Stat Soc Series B Stat Methodol 75(2):369–396.
    OpenUrlCrossRef
  29. ↵
    1. Bloniarz A,
    2. Liu H,
    3. Zhang C-H,
    4. Sekhon JS,
    5. Yu B
    (2016) Lasso adjustments of treatment effect estimates in randomized experiments. Proc Natl Acad Sci USA 113:7383–7390.
    OpenUrlAbstract/FREE Full Text
  30. ↵
    1. Rosenblum M,
    2. van der Laan MJ
    (2010) Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables. Int J Biostat 6(1):13.
    OpenUrl
  31. ↵
    1. Rosenblum M,
    2. van der Laan MJ
    (2009) Using regression models to analyze randomized trials: Asymptotically valid hypothesis tests despite incorrectly specified models. Biometrics 65(3):937–945.
    OpenUrlCrossRefPubMed
  32. ↵
    1. Lin W
    (2013) Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Ann Appl Stat 7(1):295–318.
    OpenUrlCrossRef
  33. ↵
    1. Rubin DB
    (2008) For objective causal inference, design trumps analysis. Ann Appl Stat 2(3):808–840.
    OpenUrlCrossRef
  34. ↵
    1. Aronow PM,
    2. Middleton JA
    (2013) A class of unbiased estimators of the average treatment effect in randomized experiments. J Causal Inference 1(1):135–154.
    OpenUrl
  35. ↵
    1. Humphreys M,
    2. de la Sierra RS,
    3. van der Windt P
    (2013) Fishing, commitment, and communication: A proposal for comprehensive nonbinding research registration. Polit Anal 21(1):1–20.
    OpenUrlCrossRef
  36. ↵
    1. Kirkpatrick DG,
    2. Hell P
    (1978) On the completeness of a generalized matching problem, Proceedings of the Tenth Annual ACM Symposium on Theory of Computing (Assoc Comput Machinery, New York), pp. 240–245.
  37. ↵
    1. Chen Y,
    2. Davis TA,
    3. Hager WW,
    4. Rajamanickam S
    (2008) Algorithm 887: CHOLMOD, supernodal sparse cholesky factorization and update/downdate. ACM Trans Math Software 35(3):22.
    OpenUrl
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Improving massive experiments with threshold blocking
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Massive experiments with threshold blocking
Michael J. Higgins, Fredrik Sävje, Jasjeet S. Sekhon
Proceedings of the National Academy of Sciences Jul 2016, 113 (27) 7369-7376; DOI: 10.1073/pnas.1510504113

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Massive experiments with threshold blocking
Michael J. Higgins, Fredrik Sävje, Jasjeet S. Sekhon
Proceedings of the National Academy of Sciences Jul 2016, 113 (27) 7369-7376; DOI: 10.1073/pnas.1510504113
Digg logo Reddit logo Twitter logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 113 (27)
Table of Contents

Submit

Sign up for Article Alerts

Article Classifications

  • Physical Sciences
  • Statistics

Jump to section

  • Article
    • Abstract
    • Blocking as a Graph Partition Problem
    • The Bottleneck Threshold Blocking Problem
    • An Approximately Optimal Blocking Algorithm
    • Simulation Study
    • Discussion
    • Graph Theoretical Definitions
    • Proof of NP-Hardness
    • Proof of Lemma 1
    • Subroutine for the Second Step of the Algorithm
    • Proof of Theorem 4
    • Algorithm Using Nearest Neighbor Digraphs
    • Greedy Threshold Algorithm
    • Implementation and Hardware
    • Estimation Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Abstract depiction of a guitar and musical note
Science & Culture: At the nexus of music and medicine, some see disease treatments
Although the evidence is still limited, a growing body of research suggests music may have beneficial effects for diseases such as Parkinson’s.
Image credit: Shutterstock/agsandrew.
Scientist looking at an electronic tablet
Opinion: Standardizing gene product nomenclature—a call to action
Biomedical communities and journals need to standardize nomenclature of gene products to enhance accuracy in scientific and public communication.
Image credit: Shutterstock/greenbutterfly.
One red and one yellow modeled protein structures
Journal Club: Study reveals evolutionary origins of fold-switching protein
Shapeshifting designs could have wide-ranging pharmaceutical and biomedical applications in coming years.
Image credit: Acacia Dishman/Medical College of Wisconsin.
White and blue bird
Hazards of ozone pollution to birds
Amanda Rodewald, Ivan Rudik, and Catherine Kling talk about the hazards of ozone pollution to birds.
Listen
Past PodcastsSubscribe
Goats standing in a pin
Transplantation of sperm-producing stem cells
CRISPR-Cas9 gene editing can improve the effectiveness of spermatogonial stem cell transplantation in mice and livestock, a study finds.
Image credit: Jon M. Oatley.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Latest Articles
  • Archive

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Librarians
  • Press
  • Site Map
  • PNAS Updates

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490