Information decomposition in complex systems via machine learning

Significance A defining characteristic of complex systems is an abundance of variation at one scale of observation that contains, hidden within, information about organization at another scale. To see the forest through the trees is a challenge faced whether studying society or a sandpile, climate, or a brain. We present a fully general and practical methodology, rigorously grounded in information theory, that surfaces important information out of a sea of variation in a comprehensible manner. At its core is the concept of lossy compression: Some information in a measurement is preserved, and the rest is discarded. We use machine learning to lossily compress tens to hundreds of measurements simultaneously, providing a route to insight about complex systems through information decomposition.

One of the fundamental steps toward understanding a complex system is identifying variation at the scale of the system's components that is most relevant to behavior on a macroscopic scale.Mutual information provides a natural means of linking variation across scales of a system due to its independence of functional relationship between observables.However, characterizing the manner in which information is distributed across a set of observables is computationally challenging and generally infeasible beyond a handful of measurements.Here we propose a practical and general methodology that uses machine learning to decompose the information contained in a set of measurements by jointly optimizing a lossy compression of each measurement.Guided by the distributed information bottleneck as a learning objective, the information decomposition identifies the variation in the measurements of the system state most relevant to specified macroscale behavior.We focus our analysis on two paradigmatic complex systems: a Boolean circuit and an amorphous material undergoing plastic deformation.In both examples, the large amount of entropy of the system state is decomposed, bit by bit, in terms of what is most related to macroscale behavior.The identification of meaningful variation in data, with the full generality brought by information theory, is made practical for studying the connection between micro-and macroscale structure in complex systems.
information theory | machine learning for science | complex systems | amorphous plasticity A complex system is a system of interacting components where some sense of order present at the scale of the system is not apparent, or even conceivable, from the observations of single components (1,2).A broad categorization, it includes many systems of relevance to our daily lives, from the economy to the internet and from the human brain to artificial neural networks (3,4).Before attempting a reductionist description of a complex system, one must first identify variation in the system that is most relevant to emergent order at larger scales.The notion of relevance can be formalized with information theory, wherein mutual information serves as a general measure of statistical dependence to connect variation across different scales of system behavior (5,6).Information theory and complexity science have a rich history; information theory commonly forms the foundation of definitions of what it means to be complex (7)(8)(9)(10)(11).
Machine learning is well-suited for the analysis of complex systems, grounded in its natural capacity to identify patterns in high dimensional data (12).However, distilling insight from a successfully trained model is often infeasible due to a characteristic lack of interpretability of machine learning models (13,14).Restricting to simpler classes of models, for example linear combinations of observables, recovers a degree of interpretability at the expense of functional expres-sivity (15).For the study of complex systems, such a trade-off is unacceptable if the complexity of the system is no longer faithfully represented.In this work, we do not attempt to explain the relationship between microscale and macroscale, and are instead interested in identifying the information contained in microscale observables that is most predictive of macroscale behavior-independent of functional relationship.
We employ a recent method from interpretable machine learning that identifies the most relevant information in a set of measurements (16).Based on the distributed information bottleneck (17,18), a variant of the information bottleneck (IB) (19), the method lossily compresses a set of measurements while preserving information about a relevance quantity.Optimization serves to decompose the information present in the measurements, providing a general-purpose method to identify the important variation in composite measurements of complex systems.
Identifying important variation is a powerful means of analysis of complex systems, as we demonstrate on two paradigmatic examples.First we study a Boolean circuit, whose fully-specified joint distribution and intuitive interactions between variables facilitate understanding of the information decomposition found by the distributed IB.Boolean circuits are networks of binary variables that interact through logic functions, serving as the building blocks of computation (20) and as elementary models of gene control networks (21,22).Second, we decompose the information contained in the fine-

Significance Statement
A defining characteristic of complex systems is an abundance of variation at one scale of observation that contains, hidden within, information about organization at another scale.To see the forest through the trees is a challenge faced whether studying society or a sandpile, climate or a brain.We present a fully general and practical methodology, rigorously grounded in information theory, that surfaces important information out of a sea of variation in a comprehensible manner.At its core is the concept of lossy compression: some information in a measurement is preserved and the rest is discarded.We use machine learning to lossily compress tens to hundreds of measurements simultaneously, providing a new route to insight about complex systems through information decomposition.
grained positional configuration of an amorphous material subjected to global deformation.Amorphous materials are condensed matter systems composed of simple elements (e.g., atoms or grains) whose positional disorder gives rise to a host of complex macroscale phenomena, such as collective rearrangement events spanning a wide range of magnitudes (23,24) and nontrivial phase transitions (25)(26)(27)(28).Although the state space that describes all of the degrees of freedom is large, as is generally true of complex systems, the proposed method is able to identify the most important bits of variation by leveraging machine learning to navigate the high-dimensional space of possible lossy compression schemes.

Methods
Mutual information is a measure of statistical dependence between two random variables X and Y that is independent of the functional transformation that relates X and Y (in contrast to linear correlation, for example, which measures the degree to which two variables are linearly related).Mutual information is defined as the entropy reduction in one variable after observing the value of the other variable (29), with H(X) = E x∼p(x) [−log p(x)] Shannon's entropy (30).
The information bottleneck (IB) ( 19) is an optimization objective to extract information contained in a variable X that is most relevant to another variable Y .Under the IB, X is lossily compressed to an auxiliary random variable U = f (X) that simultaneously maximizes I(U ; Y ) and minimizes I(U ; X).Consider the case where X is a composite measurement-e.g., multiple degrees of freedom of a complex system-more appropriately denoted as a random vector X = (X1, ..., XN ).By optimizing the IB, a lossy compression U of X can be found that extracts information relevant to Y , but without any indication from which components the information originated.If, instead, an information bottleneck is installed on each of the components, each Xi undergoes lossy compression to a corresponding Ui, and then the elements of the vector U = (U1, ..., UN ) contain the relevant information extracted from each component.Known as the distributed IB (17,31), configuring information bottlenecks in this way enables a detailed analysis of the structure of information with respect to the components (16).Minimization of the distributed IB Lagrangian, [2] extracts the entropy in X that is most descriptive of Y .By sweeping over the magnitude of the bottleneck strength β, a continuous spectrum of approximations to the relationship between X and Y is found, each utilizing a different amount of information from X.The product of optimization is thus a sequence of distributed compression schemes, U (β), more appropriately parameterized by the total utilized information N i=1 I(Ui; Xi).The central focus of this work is to connect the optimized U to the structure of information contained in X for insight about the complex system under study.
In place of Eqn. 2, variational bounds on mutual information have been developed that are amenable to data and machine learning (18,32).The lossy compression schemes are parameterized by neural networks that encode data points to probability distributions in a continuous latent space.Stochasticity in the mapping between Xi and Ui allows the transmitted information to be smoothly varied and consequently facilitates optimization.The amount of transmitted information is upper bounded by the expectation of the Kullback-Leibler divergence (29) between the encoded distributions and an arbitrary prior distribution, identical to the process of information restriction in a variational autoencoder (32,33).While training utilizes the aforementioned upper bound, for evaluation over the course of an optimization we desire more accurate estimates of the information I(Ui; Xi) extracted from each component.We estimated upper and lower bounds derived in Ref. (34) to obtain precise estimates of each mutual information term, where the interval between bounds was on the order of 0.01 bits.Although mutual information is generally difficult to estimate from data (35), compressing the partial measurements Xi separately segregates the information such that the amount of mutual information is small enough to allow precise estimates.Characterization of the mutual information estimation may be found in the SI.

Results
Boolean circuit.A randomly generated Boolean circuit with ten binary inputs X = (X1, ..., X10) and a binary output Y is shown in Fig. 1a.Assuming a uniform distribution over inputs, the truth table specifies the joint distribution p(x1, ..., x10, y), and the interactions between inputs are prescribed by a wiring of logical AND, OR, and XOR gates.An information bottleneck was distributed to every input Xi to monitor from where the predictive information originated via compressed variables Ui (Fig. 1b).We trained a multilayer perceptron (MLP) to learn the relationship between the lossy compressions U and Y .Over the course of a training run, the coefficient of the information bottleneck strength β was swept to obtain a spectrum of compression schemes and predictive models.The distributed information plane (Fig. 1c) (16) displays the predictive power as a function of the total information about the inputs I(Ui; Xi).The predictive performance ranged from zero predictive information without any information about the inputs (Fig. 1c, lower left) to all entropy H(Y ) accounted for by utilizing all ten bits of input information (Fig. 1c, upper right).For every point on the spectrum there was an allocation of information over the inputs; the distributed IB objective identified the information across all inputs that was most predictive.The most predictive information about Y was found to reside in X3-the input that routes through the fewest gates to Y -and then in the pair X3, X10, and so on.Inputs that perform identical roles in the circuit were compressed nearly identically, as seen in the pairs X1 and X2, X4 and X5, and X8 and X9.
Machine learning facilitated the traversal of the space of lossy compression schemes of Xi, decomposing the information held within the circuit inputs about the output.Contained within the space of compression schemes are 2 10 schemes that transmit complete information about discrete subsets of the inputs.To be concrete, there are ten subsets of a single input corresponding to compression schemes that transmit one bit of information about the input gates, 45 pairs of inputs that transmit two bits, and so on, with each subset sharing mutual information with Y based on the role of the specific inputs inside the circuit.Fig. 1d displays the information contained in every discrete subset of inputs (black points) along with the continuous trajectory found by optimization of the distributed IB (gray curve).The distributed IB, maximizing predictive information while minimizing information taken from the inputs, closely traced the upper boundary of discrete information allocations and identified a majority of the most informative subsets of inputs.To decompose the information in the circuit's inputs required only a single sweep with the distributed IB, not an exhaustive search through all subsets of inputs.We note that the product of the distributed IB is not an ordering of single variable mutual information terms I(Xi; Y ), which would be straightforward to calculate, but instead the ordering of information selected from all of X that is maximally informative about Y .Analysis of additional Boolean circuits with the distributed IB as well as comparisons to feature importance derived from logistic regression and Shapley values (15,36,37) may be found in the SI.Decomposing structural information in a physical system.Linking structure and dynamics in amorphous materialscomplex systems consisting of particles crowded together in a disordered configuration-has been a longstanding challenge in physics (38)(39)(40).In particular, the deformation of amorphous materials generally proceeds in fits and starts, punctuated by sudden, localized rearrangement events reminiscent of earthquakes along a tectonic fault (24,41,42).Where, in the disordered configuration of the particles, is the information about whether a sudden rearrangement is about to occur?Searching for signatures of collective behavior in the multitude of microscopic degrees of freedom is an endeavor emblematic of the study of complex systems more generally and one well-suited for machine learning and information theory.We accept that the functional relationship between the microand macroscale variation is potentially incomprehensible, and are instead interested in the information at the microscale that is maximally predictive of behavior at the macroscale.While prior work has analyzed the information content of hand-crafted structural descriptors individually (43)(44)(45), the distributed IB searches through the space of information from many structural measurements in combination.
Two-dimensional simulated glasses, prepared by either rapid or gradual quenching and composed of small (type A) and large (type B) particles that interact with a Lennard-Jones potential, were subjected to global shear deformation (42).For each event, the location that precipitated the rearrangement was identified and paired with negative samples from elsewhere in the system to create a binary classification dataset.
We first characterized the microscale structure with a scheme of measurements that has been associated with plastic rearrangement in a variety of amorphous systems: the densities of radial bands around the center of a region (46,47).By training a support vector machine (SVM) to predict rearrangement based on the radial density measurements, a linear combination of the values is learned.In the literature, that combination is commonly referred to as softness, and has proven to be a useful local order parameter (48)(49)(50)(51).
We approached the same prediction task from an information theoretic perspective, seeking the specific bits of variation in the density measurements that are most predictive of collective rearrangement.Each radial density measurement underwent lossy compression by its own neural network before all compressions were concatenated and used as input to an MLP to predict rearrangement.By sweeping β, a single optimization recovered a sequence of distributed compression schemes, each allocating a limited amount of information across the 100 density measurements to be most predictive of imminent rearrangement (Fig. 2).
The trajectories in the distributed information plane (Fig. 2a,c), for both gradually and rapidly quenched glasses, reflect the growth of predictive information and prediction accuracy given maximally predictive information about the radial densities.With only one bit of information from the density measurements, 71.8% predictive accuracy was achieved for the gradually quenched glass and 69.5% was achieved for the rapidly quenched glass; with twenty bits, the accuracy jumped to 91.3% and 85.4%, respectively.Beyond twenty bits of density information, the predictive accuracy became comparable to that of the support vector machine, which can utilize all of the continuous-valued density measurements for prediction with a linear relationship.
For every point along the trajectory, information was identified from the density measurements that, together, formed the combination of bits that were most predictive of rearrangement.The allocation of information across the one hundred radial density measurements would be unintelligible if displayed in the distributed information plane as in Fig. 1c; instead we employ heatmaps where each row corresponds to a different density measurement (Fig. 2b,d).The majority of the information was selected from smaller radii (close to the center of the region), which can be expected given the localized nature of rearrangement events (38,39).Less intuitive is the information decomposition as it relates to the radial distribution functions g AA (r) and g AB (r), the system-averaged radial densities of type A and B particles in regions with a type A particle at the center.For both glasses, the most predictive bits originated in the low density radial bands nearest the center.As more information was incorporated into the prediction (moving left to right in the heatmaps), the additional bits came from radial bands that corresponded to particular features of g AA (r) and g AB (r).Outside of the first low density region, the selected information often came from the high density radii of type A particles and the low density radii of type B particles; this trend held true for both glasses.While the information decomposition highlighted similar features in both glasses, the more pronounced structure of selected information out to larger radii for the gradually quenched glass is indicative of its higher structural regularity, which is also seen in the pronounced features of its radial distribution functions g AA (r) and g AB (r).For comparison, the weights of the trained SVM and estimated Shapley values may be found in the SI.
The amount of information extracted from each density measurement was predominantly a single bit or less.In contrast to the Boolean variables in Fig. 1c, where I(Ui; Xi) completely characterized the compression scheme, a continuousvalued density can be compressed to a bit in any number of ways.What was the specific variation extracted from each density measurement?Through inspection of the learned compression schemes, the extracted information can be further decomposed by the degree of distinctions between measurement values that were conveyed to the predictive model (Fig. 3a) (16).Serving to visualize a compression scheme, a distinguishability matrix is a distance matrix between val-ues xα and x β where a value of 0 indicates the states are indistinguishable, a value of 1 means that they are fully distinguishable, and an intermediate value means they are partially distinguishable.We found that the single most important bit of information for the gradually quenched glass (Fig. 3b) was a composition of partial bits from multiple density measurements, mostly from the first low-density shell of each type of particle.For both measurements, the compression scheme acted as a threshold on the range of possible density values: values less than a cutoff ρ ′ were indistinguishable from each other for the purposes of prediction and were partially distinguishable from density values above the cutoff.By examining the distribution of density values in these radial shells, we see that the cutoff values leverage the separability of the density distributions when conditioned on rearrangement.
With more information utilized for prediction, some of the compression schemes differed from simple thresholds (shown for the rapidly quenched glass in Fig. 3c).For the predictive model operating with a total of twenty bits of density information, two density measurements contributed more than a bit each.The learned compression of the first high-density shell of type A particles essentially counted the number of particles in the shell, with distinguishability between densities as if there were several thresholds over the range of the values that act to roughly discretize the density measurement.
Decomposing the information in a composite X depends upon its basis of measurements (52).In the study of complex systems, there can be multiple 'natural' schemes of measuring a system state.Measurements of the densities of radial bands lead to an essentially linear relationship between structure and rearrangement (48); what if we had not inherited such a fortuitous measurement scheme?Another natural basis of measurements is the position of all of the particles (Fig. 4a).To be clear, instead of representing a system's local configuration in terms of one hundred measurements of radial density, we could have directly used the N measurements of the position and type of the N particles in a local region.In the preceding analysis, each radial density was considered a distinct information source and a unique compression channel was learned for each.In the per-particle measurement basis, there is no straightforward means to divide the measurements into separate sources because positions exist along a continuum; accordingly, a single compression channel was used for each type, compressing in parallel every particle of type A and then similarly for type B. To subsequently predict rearrangement based on the set of compressed per-particle measurements, which lack a canonical ordering, we used a permutation-invariant transformer architecture.(53).
The per-particle measurement scheme imposed no positional structure on the selection of configurational information.Nevertheless, we found that the information cost per particle as a function of the position in the neighborhood had a radial structure (Fig. 4b).The information per particle was highest in the low density radial bands near the center of the region (Fig. 4c), and inspection of the compression scheme indicated that negligible azimuthal information was transmitted (Fig. 4d).The information decomposition allowed for similar insights to be derived as in the radial density measurement scheme, even though the nature of the predictive model in the two cases was substantially different.Additionally, because the distributed IB operates entirely on the input side of an arbitrary predictive model, the information analysis was agnostic to whether the model was a simple fully connected network or a more complicated transformer architecture.

Discussion
A universal challenge faced when studying complex systems, fundamental to what makes a system complex, is the abundance of entropy from the perspective of the microscale that obscures relevant information about macroscale behavior.The generality of mutual information as a measure of statistical relatedness, and the capacity of deep learning to handle highdimensional data, allow the distributed IB to be as readily utilized to identify structural defects relevant to a given material property as it is to reveal gene variation relevant to a given affliction.Tens, hundreds, and potentially thousands of measurements of a complex system are handled simultaneously, rendering practical analyses that would have previously been infeasible through exhaustive search or severely limited by constraints on functional relationships between variables.
Information theory has long held appeal for the analysis of complex systems owing to the generality of mutual information (1,29).However, the estimation of mutual information from data is fraught with difficulties (34,35,54), and the rapid growth in ways information can be distributed across a number of variables (55) have hindered information theoretic analyses of data from complex systems.By distributing information bottlenecks across multiple partial measurements of a complex system, entropy is partitioned to a degree that makes precise estimation of mutual information possible while simultaneously revealing the most important combinations of bits for insight about the system.Machine learning navigates the space of lossy compression schemes for each variable and allows the identification of meaningful variation without consideration of the black box functional relationship found by the predictive model.
Instead of compressing partial measurements in parallel, the information bottleneck (19) extracts the relevant information from one random variable in its entirety about another, and is foundational to many works in representation learning (56,57).In the physical sciences, the IB has been used to extract relevant degrees of freedom with a theoretical equivalence to coarse-graining in the renormalization group (58,59), and to identify useful reaction coordinates in biomolecular reactions (60).However, the IB generally has limited interpretability because the singular bottleneck occurs after processing the complete input, allowing the compression scheme to involve arbitrarily complex relationships between components of the input without penalty.Additionally, when the relationship between X and Y is deterministic (or nearly so), the entire spectrum of extracted information is trivially a copy of Y with added noise (52,61,62).The distribution of information bottlenecks across observables is critical to an interpretable information decomposition and to recovering an illuminating spectrum of extracted information.
A growing body of literature focuses on a fundamentally different route to decompose the information contained in multiple random variables {Xi} about a relevant random variable Y ; that alternative route is partial information decomposition (PID) (55,63).Although there is no consensus on how to achieve PID in practice, its goal is to account for the mutual information between {Xi} and Y in terms of subsets of {Xi}, by analogy to set theory (64).PID allocates information to the input variables in their entirety, whereas the distributed IB selects partial entropy from the input variables in the form of lossy compression schemes, with one scheme per variable.While PID has been proposed as an information theoretic route to study complex systems (65) and quantify complexity (66), the super-exponential growth of PID terms renders the methodology somewhat impractical unless "coarse-grained" metrics are used (10,67).There are 5 × 10 22 PID terms for a Boolean circuit with 8 inputs (55) and the number of terms for the 10-input circuit from Fig. 1 is not known (63).By contrast, the distributed IB offers a pragmatic route to the decomposition of information in a complex system: it is amenable to machine learning and data, and can readily process one hundred (continuous) input variables as in the amorphous plasticity experiments.Although the distributed IB navigates a different space of information terms than those induced by PID, it may prove fruitful to employ the former as a means to focus on the most significant terms of the latter.Data, Materials, and Software Availability.The glass dataset is attached to this paper and has also been deposited to Figshare (68).The code base, including code to generate truth tables from Boolean circuits, has been posted to Github and may be found through the following link: distributedinformation-bottleneck.github.io.

Boolean circuitry: Comparative analyses and extended examples
To further develop intuition about the manner of information decomposition achieved by the distributed IB, we analyze several additional Boolean circuits in this section.As a reminder, the distributed IB ingests a dataset of input-output observations and yields a decomposition of the information contained in the inputs about the output.The decomposition offers a degree of interpretability about the relationships between the inputs so far as they determine the output.The ground truth circuitry in these examples is used to create the data but then does not factor into the information decomposition; it is displayed only as a point of reference.
First, we randomly generated Boolean circuits with three to six input gates (Fig. S1).To serve as reference analyses, we analyzed data from the circuits with a linear and a nonlinear method that each provide a sense of the importance of the inputs Xi in determining the output Y .We performed logistic regression and computed the Shapley values in relation to the mutual information between {Xi} and Y (15, 36, 37) (Fig. S1, left column).We then optimized the distributed IB following the training protocol of Fig. 1 (Fig. S1, middle column).We again compared the distributed IB decomposition to the exhaustive set of mutual information terms, where there is one for each possible subset of inputs (Fig. S1, right column).By contrast to Fig. 1 of the main text, the circuits are small enough to allow visualization of the specific input combination (shown as pie charts) represented by each point in the plot.
In logistic regression, a linear combination of the inputs is fit in log-probability space to be maximally predictive of the label Y , and this linearity grants interpretability to the model weights.However, the linearity also severely restricts the relationships that can be modeled.In the six circuits in Fig. S1, the nonlinearity of XOR gates diminished predictive power of the logistic models: the amount of information captured by the model, I(f (X); Y ), was less than 0.2 bits for four of the circuits.In the best case (the circuit in panel (e), with only one out of four gates being XOR), the model captured 0.75 bits of information about Y and had an accuracy of 94%.Despite the inability to represent the nonlinearity of XOR, there was some degree of success in shedding light on the inner workings of the circuit: the model coefficients revealed one or a few of the most informative inputs in circuits (b), (c), (d), and (e).
Shapley values (15) have become a popular approach to explaining machine learning models because they convey a sense of importance of inputs (or features) with regard to an output while being agnostic to the model used.Originating in game theory as a way to assign credit to a set of players in a game, the most common formulation in explainable ML (36) grants local interpretability (15) to a trained model; i.e., it provides case-by-case explanations of model outputs for individual examples from the data.By contrast, the distributed IB provides global interpretability (16), meaning the insights pertain to the entire data distribution (and the entire system) at once.Instead of assigning credit for a single model prediction, Shapley values can be computed with regard to the information gain of a model in order to provide global interpretability (37).We use the truth table for each circuit as a perfect model, though we could also have trained a neural network and estimated Shapley values following Covert et al. (37).The Shapley value for each input is a weighted sum of all possible mutual information terms (the same terms displayed in the right column of Fig. S1), serving to summarize the contribution of information for each input in the context of every combination of other inputs.A desirable property of Shapley values is that they are additive: for the deterministic relationships represented by these circuits, the Shapley values for all inputs sum to the entropy H(Y ) of the output.
The Shapley values relative to information gain conveyed more about the underlying circuitry than the logistic model coefficients, which is sensible given that the "model" used for the Shapley values is the full truth table.We found that the values conveyed the same high-level feature importance as the distributed IB, successfully accounting for higher-order interaction effects between the gates because the information in all subsets of inputs is included in the calculation.
In contrast to logistic regression and Shapley values, which produce a single value of importance per input, the distributed IB produces an entire spectrum of importance values in the form of information allocations across the inputs.The spectrum represents a Pareto frontier, where every point in the spectrum contains the most information about the output Y for the least total information about the inputs Xi.The allocations order the information contained in the inputs from most to least relevant about the output Y .By reading the distributed information plane from left to right (middle column of Fig. S1), we infer the importance of the inputs by the order in which they appear in the information allocations.The rank order of the importance of the inputs matches that of the Shapley values in every circuit.What else can be learned from the distributed IB by way of the spectrum of information allocations?
Consider the three-input circuits of Fig. S1 (a) and (b).Given only the Shapley values for the three inputs in panel (a) we infer that the inputs are equally responsible for determining the output, but little else is learned.With the information decomposition from the distributed IB, we learn far more about the circuit.The identical trajectories of the inputs in the distributed information plane suggests the inputs perform equivalent roles.The slow growth of the predictive information I(U ; Y ) suggests a highly entangled interaction between the inputs whereby substantial information about all three is required before any information is gleaned about Y .By comparison, the growth of predictive information for the circuit in panel (b), a b Supplemental Material, Figure S2.The space of compression schemes navigated by the distributed and standard IB.(a) For a single XOR gate, the two inputs are compressed independently according to the distributed IB, in terms of relevance to the output Y .We randomly sample 2500 compression schemes (gray points) and additionally optimize the variational formulation of the distributed IB from the main text (blue, dashed).(b) With the same XOR gate as in a, the full input X is compressed to a single compression variable U .We randomly sample 15000 compression schemes (gray points) and additionally optimize the variational IB (blue, dashed).The space of distributed compression schemes in a is contained within the space of compression schemes of b.Because the components of X are independent in this example, the horizontal axes of a and b are equivalent.
which increases immediately with information about X3, tells us that information about Y is available with information about only the third input.
Consider again the three curves displaying the information allocation to the inputs in the distributed information plane of Fig. S1b.One (corresponding to X3) is concave while the other two (X1 and X2) are convex and mirror each other; the curves tell about the growth of information of the inputs in combination.Similar behavior can be seen in Fig. S1c with X4, X1 and X3; inspection of the circuits shows there is an AND-XOR subcircuit common to the circuits in both panels (b) and (c).By chance (because the circuits were randomly sampled) the replication can be taken one step further.The entire information allocation of the circuit in Fig. S1c is found in the circuit in Fig. S1e, and again the circuits can be observed to share a subcircuit.The information allocations and the growth of predictive information tell much more about the interactions between inputs than can be conveyed in a single list of importance values.
In the right column of Fig. S1, we display the information about Y contained in all possible subsets of inputs, allowing a check on the quality of information allocations recovered by the distributed IB.As in the main text, we again found that the distributed IB was able to identify the most informative subsets without requiring exhaustive search.Notably, for a circuit of only XOR gates (Fig. S1a), the distributed IB outperformed the discrete allocations by leveraging partial information allocations.With only discrete information allocations across the inputs, there is a discontinuous transition from zero information about Y with knowledge of any two input gates, to complete information about Y when knowing all three.In stark contrast, by exploring the larger space of soft compression schemes (in which partial information can be transmitted), machine learning found a smooth interpolation between zero and complete information about Y .
Comparison of the space of compression schemes relevant to the distributed and standard IB.In Fig. S2 we analyzed a single XOR logic gate with both the distributed IB and the standard IB.For the distributed IB (Fig. S2a), each of the inputs X1 and X2 was compressed to its own variable, U1 and U2, defined by the conditional distributions p(u1|x1) and p(u2|x2).We sampled from the space of distributed compression schemes by randomly sampling conditional distributions over a two-symbol alphabet for each Ui.2500 distributed compression schemes are plotted as gray dots in the distributed information plane in Fig. S2a, along with the trajectory navigated by the distributed IB.The convex shape of the distributed IB trajectory closely traced the boundary of distributed compression schemes.
By contrast to the distributed compression schemes, the standard IB compresses the entire input X to a single variable U and optimizes the IB Lagrangian (19) L IB = βI(U ; X) − I(U ; Y ).[3] We sampled from the space of compression schemes by randomly sampling conditional distributions over a four-symbol alphabet for U (Fig. S2b).In total, 15000 compression schemes were created by sampling the sixteen values of the relevant conditional distributions from a uniform distribution and raising to the sixth power before normalizing.We found that sampling probability values uniformly (i.e., without exponentiation) rarely sampled compression schemes with I(U ; X) ≳ 1 bit.With the same variational bounds we used for the distributed IB (32), we optimized the standard IB and found that it successfully traced the boundary of relevant compression schemes.The space of distributed compression schemes is more restrictive than the space of compression schemes relevant to the standard IB.Because the components of X are independent for this example, the two horizontal axes of Fig. S2a,b are equivalent, allowing the point scatter of compression schemes to be directly compared.While the distributed schemes in b a Supplemental Material, Figure S3.Comparative analyses of radial density measurement scheme.(a) For the simulated glass prepared with a gradual quench, we reproduce the information allocation from the main text and compare to the weights of a support vector machine (SVM) and to the estimated Shapley values for the model's information gain.The absolute value of the SVM weights are shown, with red (blue) to indicate that the weight is positive (negative).The error bars on the Shapley values are estimated during sampling according to SAGE (37).The radial densities for type A particles are shown on the left and type B on the right, though we note that both sets of densities were used in combination as input to the models.The radial distribution functions g AA (r) and g AB (r), showing system-averaged radial densities, are the black curves and utilize the right vertical axes.(b) The same as a, for the rapidly quenched glass.
Fig. S2a are included in the space of schemes in Fig. S2b, only the latter includes schemes that utilize knowledge of the full state of X.As an example, one scheme that optimizes the standard IB maps X ∈ {01, 10} and X ∈ {00, 11} to two separate clusters, effectively extracting Y without revealing anything about the structure of X-a shortcoming of the standard IB when analyzing deterministic relationships (61).

Glassy rearrangement: Comparative analyses of radial density measurements
For the simple Boolean circuits of Fig. S1, all three analyses-logistic regression, Shapley values, and the distributed IBconveyed a sense of importance of Xi in determining Y .The unique capabilities of the distributed IB become more apparent under more challenging scenarios, such as relating rearrangement in a simulated glass to local radial density measurements.In Fig. S3, we reproduce from Fig. 2 of the main text the distributed IB information decomposition and compare it to the weights of a support vector machine (SVM) as well as the Shapley values estimated for an MLP * (following a method named SAGE (37) † ).There is general agreement about the most important radial bands as told by the order of information allocation by the distributed IB, and the magnitude of weights derived from the (linear) SVM and the (nonlinear) Shapley values.The distributed IB reveals far more, however: all of the most informative subsets of radial bands and the specific bits of information from each radial density that are relevant to rearrangement (Fig. 3 of the main text).To find the most informative subsets of radial bands, the authors of ( 48) trained millions of SVMs, each on different subsets of the radial densities-quickly making exhaustive evaluation impractical beyond a handful of measurements.
Regarding the specific bits of relevant information (visualized by the distinguishability matrices of Fig. 3 in the main text), we are aware of no method other than the distributed IB that can do so while modeling nonlinear relationships with the full expressivity of deep learning.Finally, we note that the per-particle measurement basis (Fig. 4 of the main text) similarly has no apparent analogue in terms of Shapley values.With the distributed IB we are able to inspect the compression channel and study the information allocated to hypothetical particles, even without a well-defined set of features as would be required for analysis by methods that assess feature importance.

Mutual information bounds
Bounding mutual information given high-dimensional data is notoriously difficult (35,54).Fortunately, there are factors in our favor to facilitate optimization with machine learning and, during evaluation, the recovery of tight bounds on the information transmitted by the compression channels Ui.
During training, to optimize the distributed information bottleneck objective requires a lower bound on I(U ; Y ) and an upper bound on I(Ui; Xi).For I(U ; Y ) = H(Y ) − H(Y |U ), we use the cross entropy loss of the predictions as a lower bound on H(Y |U ) and ignore H(Y ) because it is constant (32).Regarding I(Ui; Xi), the (distributed) variational information bottleneck objective (18,32) upper bounds I(Ui; Xi) with the expectation of the Kullback-Leibler (KL) divergence between the encoded distributions p(ui|xi) and an arbitrary prior distribution r(ui) in latent space, I(Ui; Xi) ≤ E x i ∼p(x i ) [D KL (p(ui|xi)||r(ui))].[4] Normal distributions are used for both the encoded distribution, p(ui|xi) = N (µ = fµ(xi), σ = fσ(xi)), and the prior, r(ui) = N (0, 1) so that the KL divergence has a simple analytic form.
For evaluation over the course of a training run, the KL divergence is computed for each channel Ui in the process of computing the loss, and could be used for a qualitative sense of information allocation to features (16).However, the KL divergence is a rather poor estimate of the mutual information, and we seek to know the amount of transmitted information as precisely as possible.Because the encoded distributions p(ui|xi) have a known form, we can use the noise contrastive estimation (InfoNCE) lower bound and "leave one out" upper bound from Ref. (34) with a large number of samples to obtain tight bounds on the amount of mutual information in the learned compression schemes (for evaluation).
The lower and upper bounds on I(Ui; Xi) are based on likelihood ratios at points sampled from the dataset xi ∼ p(xi) and from the corresponding conditional distributions, ui ∼ p(ui|xi).To be specific, the mutual information for each channel U = f (X) (dropping channel indices for simplicity) is lower bounded by 1 K K j p(ui|xj) [5] and upper bounded by . [6] The expectation values in both equations are taken over samples {ui, xi} K i=1 of size K extracted repeatedly from the joint distribution p(u, x) = p(x)p(u|x).We estimated with as large an evaluation batch size K as feasible given memory and time considerations, and then averaged over multiple batches to reduce the variance of the bound.Evaluation with a batch size of 1024, averaged over 8 draws, yielded bounds on the mutual information that was on the order of 0.01 bits for the Boolean circuit and glass data.The size of the validation dataset for the glass and the size of the truth table of the Boolean circuit were both on the order of one thousand points.Hence, the benefit of averaging comes from repeated sampling of the latent representations.
We show in Fig. S4 the performance of the mutual information bounds for compression schemes that encode up to several bits of information.X is a discrete random variable that is uniformly distributed over its support and has one to six bits of entropy; for each X a fixed dataset of size 1024 was sampled for mutual information estimation according to the following method of compression.Each outcome x was encoded to a normal distribution with unit variance in 32-dimensional space, p(u|x) = N (µ, 1).The encoded distributions were placed along orthogonal axes a distance d from the origin; in the limits of d = 0 and d ≫ 1 the information transmitted by the compression scheme is 0 and H(X), respectively.
A Monte Carlo estimate of the mutual information sampled 2×10 5 points from p(u, x) to compute E p(u,x) [log p(u|x)/p(u)].The "leave one out" upper and InfoNCE lower bounds were computed with different evaluation batch sizes K, and averaged over 4096 sampled batches.The standard deviation of the bounds is displayed as the shaded region around each trace, and is left out of the plots for the residual (the difference between the bound and the Monte Carlo estimate) for all but the evaluation batch size of 1024.
When the information contained in the compression is less than about two bits-as was the case for the majority of the experiments of the main text-the bounds are tight in expectation for even the smallest evaluation batch size.The variance is reducible by averaging over multiple batches.As the transmitted information grows, the benefit of increasing the evaluation batch size grows more pronounced, though bounds with a range of less than 0.1 bits can still be achieved for up to six bits of transmitted information.
Information transmitted per particle.For the per-particle measurement scheme on the glass data, a single compression channel U was used for all particles.The information conveyed by the channel I(U ; X) may be estimated as above, with X being the particle position and type.Note that we are particularly interested in the information cost for specific particle positions and for each particle type.The outer summation of the bounds (Eqn.5 and 6) serves to average over the measurement outcomes xi in a random sample; we use the summand corresponding to {xi, ui} as the information contribution for the specific outcome xi.To generate the information heatmaps of Fig. 4b in the main text, we randomly sampled 512 neighborhoods from the dataset, corresponding to an evaluation batch size K = 512 neighborhoods × 50 particles / neighborhood = 25, 600 particles (data points), and averaged over 100 such batches.A probe particle with specified particle type and position (one for each point in the grid) was inserted into the batch, and then the corresponding summand for the lower and upper information bounds served to quantify the information transmitted per particle.To be specific, , [7] with the expectation taken over u ∼ p(u|x) and samples {xi} K i=1 ∼ K i p(x).The upper bound differed only by inclusion of the distribution p(u|x) corresponding to the probe point in the denominator's sum.

Implementation specifics
All experiments were implemented in TensorFlow and run on a single computer with a 12 GB GeForce RTX 3060 GPU.Computing mutual information bounds repeatedly throughout an optimization run contributed the most to running time.Including the information estimation, the Boolean circuit optimization took about half an hour, and the glass experiments took several hours.
Boolean circuit.Each input may take only one of two values (0 or 1), allowing the encoders to be extremely simple.Trainable scalars (⃗ µi, log ⃗ σ 2 i ) were used to encode p(ui|xi) = N ((2xi − 1) × ⃗ µi, ⃗ σ 2 i ).The decoder was a multilayer perceptron (MLP) consisting of three fully connected layers with 256 Leaky ReLU units (α = 0.3) each.We increased the value of β logarithmically from 5 × 10 −4 to 5 in 5 × 10 4 steps, with a batch size of 512 input-output pairs sampled randomly from the entire 1024-element truth table.We found the information decomposition to be insensitive to annealing duration as long as it was around 5 × 10 4 steps or longer.The Adam optimizer was used with a learning rate of 10 −3 .Amorphous plasticity.The simulated glass data comes from Ref. (42): 10,000 particles in a two-dimensional cell with Lees-Edwards boundary conditions interact via a Lennard-Jones potential, slightly modified to be twice differentiable (69).Simple shear was applied with energy minimization after each step of applied strain.The critical mode was identified as the eigenvectorexisting in the 2N -dimensional configuration space of all the particles' positions-of the Hessian whose eigenvalue crossed zero at the onset of global shear stress decrease.The particle that was identified as the locus of the rearrangement event had the largest contribution to the critical mode (42).
We used data from the gradual quench ("GQ") and rapid quench (high temperature liquid, "HTL") protocols.Following Ref. (48), we considered only neighborhoods with type A particles (the smaller particles) at the center.We used all of the events in the dataset: 7,255 for the gradually quenched and 10,178 for the rapidly quenched glasses.For each rearrangement event with a type A particle as the locus, we selected at random another region from the same system state with a type A particle at the center to serve as a negative example.90% of all rearrangement events with type A particles as the locus were used for the training set and the remaining 10% were used as the validation set; the regions and specific training and validation splits used in this work can be found on the project webpage.
Radial density measurement scheme.For the radial density measurements (Figs.2,3 of the main text), the local neighborhood of each sample was processed using 50 radial density structure functions for each particle type, evenly spaced over the interval r = [0.5, 4].Specifically, for particle i at the center and the set of neighboring particles SA of type A, ), [8] where Rij is the distance between particles i and j.The same expression was used to compute GB, the structure functions for the type B particles in the local neighborhood.The width parameter δ was equal to 50% of each radius interval.
After computing the 100 values summarizing each local neighborhood, the training and validation sets were normalized with the mean and standard deviation of each structure function across the training set.The best validation results from a logarithmic scan over twenty values from 10 −3 to 10 1 for the C parameter were used for the value of the SVM accuracy in Fig. 2 of the main text.
For the distributed IB, each of the 100 scalar values for the structure functions were input to their own MLP consisting of 2 layers of 128 units with tanh activation.The embedding dimension of each Ui was 32.Then the 100 embeddings were concatenated for input to the predictive model, which was an MLP consisting of 3 layers of 256 units with tanh activation.The output was a single logit to classify whether the particle at the center is the locus of imminent rearrangement.We increased β in equally spaced logarithmic steps from 10 −6 to 1 over 250 epochs (an epoch is one pass through the training data); these hyperparameters were selected based on peak validation performance.The batch size was 256.The Adam optimizer was used with a learning rate of 10 −4 .
Per-particle measurement scheme.For the per-particle measurements, the nearest 50 particles to the center of each region were compressed by the same encoder, an MLP with two layers of 128 Leaky ReLU activation (α=0.1), to a 32-dimensional latent space.The only information available to the encoder was the particle's position and type, though the values were preprocessed before input to the encoder to help with optimization: for each particle position ⃗ r = (x, y), we concatenated x 2 , y 2 , r = |⃗ r|, log r, log x 2 , log y 2 , and ⃗ r/r.All were positionally encoded (i.e., before being passed to the MLP, inputs were mapped to x ← (x, sin ω1x, sin ω2x, ...)) with frequencies ω k = 2 k , where k ∈ {1, 2, 3, 4, 5} (16,70).
After compression, the 50 representations (one for each particle) were input to a set transformer (53), a permutation-invariant architecture that is free to learn how to relate different particles via self-attention.We used 6 multi-head attention (MHA) blocks with 12 heads each, and a key dimension of 128.Following Ref. (53), each MHA block adds the output of multi-head attention to a skip connection of the block's input, and applies layer normalization to the sum.This intermediate output is passed through an MLP (a single layer with 128 ReLU units, in our case) and added to itself (another skip connection) before a second round of layer normalization.After the MHA blocks, the 50 particle representations were mean-pooled and passed through a final fully connected layer of 256 units with Leaky ReLU activation (α=0.1)before outputting a logit for prediction.
Training proceeded for 25,000 training steps, and the learning rate was ramped linearly from zero to 10 −4 over the first 10% of training.Over the duration of training, β increased logarithmically from 3 × 10 −8 to 3 × 10 −3 .The batch size was 64.

Citation Diversity Statement
Science is a human endeavour and consequently vulnerable to many forms of bias; the responsible scientist identifies and mitigates such bias wherever possible.Meta-analyses of research in multiple fields have measured significant bias in research works are cited, to the detriment of scholars in minority groups (71)(72)(73)(74)(75).We use this space to amplify studies, perspectives, and tools that we found influential during the execution of this research (76)(77)(78)(79).

Glass dataset
glass_data.tar.gzThe train and validation splits of the glass data, consisting of local neighborhoods immediately preceding rearrangement events, derived from Ref. (42), have been deposited on Figshare as a directory of numpy arrays (https://doi.org/10.6084/m9.figshare.24585150.v1)(68).The neighborhoods were subsequently "measured" as radial densities (Figs. 2 and 3) or as per-particle descriptors (Fig. 4); the code to process the neighborhoods is in the project Github repository.Also included in the dataset are the computed radial distribution functions, g AA (r) and g AB (r), for each quench.

Fig. 1 .
Fig. 1.Decomposing the information contained in the inputs of a Boolean circuit.(a) Ten binary inputs X = (X1, ..., X10) are connected via AND, OR, and XOR gates to a binary output Y .(b) A lossy compression Ui is learned for each Xi and then all Ui are combined as input to a machine learning model trained to predict Y .(c) The distributed information plane displays the predictive information about the output (left vertical axis, black) as a function of the total information utilized about the input.For each value of total information into the model there is an allocation of information to the input gates indicating their relevance to the output Y (right vertical axis, colors corresponding to input gates in panel (a)).The subset of inputs identified as containing the most relevant information (I(Ui; Xi) ≥ 0.1 bits) are indicated at the top of the plot.Dashed lines are used for the information allocations when there is significant overlap.(d) The mutual information between all subsets of input channels and Y are displayed on the distributed information plane as black circles.The optimization of the distributed IB (gray curve) identified subsets of inputs that contain the most predictive information (open circles).

Fig. 2 .
Fig. 2. Decomposing structural information about imminent rearrangement in a sheared glass.(a)Inset: Given a local neighborhood in a sheared glass, densities of radial shells for the small (type A) and large (type B) particles were used to predict whether the neighborhood is the locus of an imminent rearrangement event.Main: For a gradually quenched glass, the information that is predictive of rearrangement (black) increased as the most predictive density information was identified and incorporated into the machine learning model.The accuracy (blue) was comparable to a support vector machine (SVM) (dashed line) after around twenty bits.(b) Sharing the horizontal axis with panel (a), the amount of information extracted about each of the radial density measurements of small (top) and large (bottom) particles reveals the radii with the most predictive information at each level of approximation.The system's average density values for each particle type with type A at the center, also known as the radial distribution functions g AA (r) and g AB (r), are shown on the right.(c,d) The same as panels (a, b) but for glass that was prepared via a rapid quench rather than a gradual quench.

Fig. 3 .Fig. 4 .
Fig. 3. Selected bits of information as distinctions among raw measurement values.(a) Lossy compression is achieved by mapping the raw values of X to probability distributions in latent space.The statistical similarity of the conditional distributions, visualized as a distance matrix for all pairs of feature values, determines how distinguishable the raw feature values are to the predictive model.(b) The single most predictive bit of information about rearrangement in the gradually quenched glass came predominantly from two density measurements.The distinguishability matrices indicate that the compression scheme applied a simple threshold to these measurements: density values less than a cutoff value ρ ′ were indistinguishable from each other, as were values above the cutoff.The histograms of density values conditioned on rearrangement (right) show that the learned cutoff value separates the probability masses.(c) The twenty most predictive bits of radial density information in the rapidly quenched glass were selected from many radial bands.The two that contribute more than a bit of information each correspond to the density of type A particles near the center; one compression scheme effectively counted the number of particles in the high density shell.The distinguishability matrices of the next five most informative radial bands are shown below.