Topological data analysis of zebrafish patterns

Significance While pattern formation has been studied extensively using experiments and mathematical models, methods for quantifying self-organization are limited to manual inspection or global measures in many applications. Our work introduces a methodology for automatically quantifying patterns that arise due to agent interactions. We combine topological data analysis and machine learning to provide a collection of summary statistics describing patterns on both microscopic and macroscopic scales. We apply our methodology to study zebrafish patterns across thousands of model simulations, allowing us to make quantitative predictions about the types of pattern variability present in wild-type and mutant zebrafish. Our work helps address the widespread challenge of quantifying agent-based patterns and opens up possibilities for large-scale analysis of biological data and mathematical models.


Introduction
Patterns are widespread in nature and often form due to the self-organization of independent agents.Whether exploring such collective dynamics in cancer [1], wound healing [2], hair growth [3,4], or skin pattern formation [5,6], researchers focus on uncovering unknown cell behavior and signaling using a combination of experimental and modeling techniques.This process is complicated by the fact that biological patterns are inherently variable, making it challenging to quantify the distinguishing features of different mutants and judge model accuracy.In some applications, such as zebrafish skin patterns (Fig. 1A-D), global information about patterns both in vivo and in silico is largely based on visual inspection, and this naturally leads to more subjectivity and limits the scale of the analyses.Moreover, the focus is often on the characteristic features of different mutants, making it unclear how much variability normally arises in mutant patterns, and how this variability compares to wild-type.To help address these challenges, here we develop a new methodology, based on topological data analysis and machine learning, for quantifying self-organized patterns with an automated, agent-based approach, and we apply our methods to study variability in zebrafish skin patterns.
Characterized by black and gold stripes, the zebrafish (Danio rerio) is a model organism in the field of skin pattern formation [5,7,8].Remarkably, zebrafish stripes form due to the interactions of tens of thousands of different colored cells, which reliably self-organize on the growing skin despite their stochastic environment [9][10][11].In addition to their namesake stripes, zebrafish feature a wealth of other patterns (e.g., spots and labyrinth curves [12]) that form due to genetic mutations that restrict cell birth or alter cell behavior (often in unknown ways).While wild-type stripes (Fig. 1A) are considered robust, mutants that lack certain cell types (Fig. 1B-D) feature more variable spotty patterns [12].For example, the nacre phenotype [10,12,13] has an enlarged central orange region with scattered blue splotches (Fig. 1B).In comparison, both the pfeffer [10][11][12]14] and shady [12,15] mutants are characterized by dark spots, roughly aligned in stripes.These patterns differ in their finer details: pfeffer has messy spots and peppered black cells across its skin, while shady has sharp boundaries between light and dark regions [12].Although these descriptions apply in general, patterns vary due to the stochastic nature of pigment cell interactions.
Mathematical descriptions of zebrafish patterns capture stochastic cellular interactions at different levels of detail.While partial differential equations (e.g., [9,16,17]) offer a broad perspective on the evolution of cell densities, cellular automaton [18,19] and agent-based models [20][21][22] provide a more detailed view of individual cell behavior.For example, the agent-based model [21] specified cell interactions using stochastic rules to simulate zebrafish patterning in silico (Fig. 1F-I).Ideally, models should reproduce pattern formation as it is observed in vivo, and this raises the question: how can we systematically quantify and compare pattern features, particularly in the presence of biologicallyinduced variability?Moreover, researchers seek to identify the cell interactions that are altered in mutant patterns, but this process is limited by the large number of parameters in agent-based models and the need for visual inspection to analyze simulation results.Reliable, automatic quantification of patterns (for both in vivo and in silico data) is therefore fundamental to measuring how well models perform and increasing their predictive potential.
Many black-box machine learning algorithms have been developed for pattern classification, but interpretable methods provide results with a more transparent relationship to biological data.In this vein, Lee et al. [23] showed how to use ImageJ [24] to quantify some traits of giraffe spots; while their process can be automated, it relies on data in the form of contiguous blocks of bits in an image and only captures macroscopic pattern features, losing the underlying discrete, cell-based nature of the biological data.Taking a different approach, Miyazawa et al. [25] assigned a "pattern simplicity score" (associated with the circularity of black-white boundary contours) to images of salmon patterns, and they quantified overall color tone by calculating the ratio of light to dark areas on fish images.These two global measures, which were applied to trout in [26], are broadly applicable but are not intended to capture detailed features.The methodology that we introduce in this paper, in contrast, utilizes the cell-based nature of skin patterns to quantify both macroscopic pattern attributes and microscopic features on the cellular level.
As shown in [28], topological data analysis (TDA) has emerged as a valuable tool for characterizing collective behavior and self-organization.Tools from TDA, specifically persistent homology, allow one to assign shape descriptors to noisy or large data across a range of spatial scales.In the case of collective behavior, this translates to measuring topological summaries (e.g., connected components and loops) of the resulting patterns from the cellular level to the global level.In [28], TDA was applied to study the velocity and positions of agents in simulations of a flocking model.By tracking global persistent homology features over time, Topaz et al. [28] were able to identify agent clusters and detect the presence of global dynamics that would be challenging to notice visually.While such prior work [25,26,28] has demonstrated how to quantify various overall features of patterns, characterizing the distinguishing traits of the different zebrafish patterns in Fig. 1 at the level of pigment cells requires a more detailed perspective.
Inspired by the utility of TDA for quantifying collective behavior, here we show how to reinterpret topological summaries as detailed measurements of pattern features.By combining TDA with interpretable machine learning techniques and working closely with the biological literature on zebrafish, we are able to automatically detect and quantify patterns given agent (e.g., cell) coordinate data.Our main contribution is an automated, interpretable framework for counting stripes and spots, detecting broken stripes, measuring stripe widths, quantifying stripe straightness, calculating spot size and roundness, measuring spot placement, and estimating the onset of stripe formation from pattern data.To illustrate our techniques, we apply our methods to thousands of in silico images of zebrafish patterns generated using the agent-based model from [21].Because zebrafish display a wide range of patterns, we expect that our methodology can be applied to other problems in biological self-organization as well as to in vivo data.Our approach opens up a range of possibilities for large-scale analysis of experimental images to better understand the cellular mechanisms underlying pattern formation.

Background and Methods
Here we give a brief overview of zebrafish biology and the model [21], as well as an introduction to the TDA and machine learning concepts that we use in our methodology (see the Supplementary Materials for additional background).xanthophores strictly colocalise with dense S-iridophores (B! ); the arrow points to a region of X1V that is enlarged in B" .S-iridophores spread over the entire flank in nac;pfe (C! ).Melanophores associated with the blood vessels are visible along myosepta in shd;pfe(D,D!).
epinephrine (Sigma) for melanosome contraction, and photographed under a Leica M205 XA stereomicroscope with a Leica DCX300 X] camera using the software LAS ^4.1 to allow multifocus images.An illumination was chosen to display iridophores optimally while Yanthophore visibility is poor.Photographs were processed in Adobe Photoshop.

RESULTS
Melanophore stripe formation is affected in the iridophore mutants rse and shd Un _ebrafish, melanophores are restricted to the stripes whereas iridophores and Yanthophores are present in both stripes and interstripes (Xig.1A; Xig.2A).Two types of iridophores can be distinguished in the trunk.Superficial S-iridophores form a dense _one in the interstripe (`dense S-iridophoresa) and spread as a thin net-like layer (`blue S-iridophoresa) over the melanophores of the stripe.]anthophores are located on top of S-iridophores.Liridophores form a homogeneous silvery sheet below melanophores (birata et al., 2003).The colour of the iridophores may appear silvery, golden, brownish or blue, depending on the illumination (Xig.1A; Xig. 2).
Mutations in the genes rse and shd result in similar recessive adult phenotypes of varying strengths, displaying (1) a reduction epinephrine (Sigma) for melanosome contraction, and photographed under a Leica M205 XA stereomicroscope with a Leica DCX300 X] camera using the software LAS ^4.1 to allow multifocus images.An illumination was chosen to display iridophores optimally while Yanthophore visibility is poor.Photographs were processed in Adobe Photoshop.

RESULTS
Melanophore stripe formation is affected in the iridophore mutants rse and shd Un _ebrafish, melanophores are restricted to the stripes whereas iridophores and Yanthophores are present in both stripes and interstripes (Xig.1A; Xig.2A).Two types of iridophores can be distinguished in the trunk.Superficial S-iridophores form a dense _one in the interstripe (`dense S-iridophoresa) and spread as a thin net-like layer (`blue S-iridophoresa) over the melanophores of the stripe.]anthophores are located on top of S-iridophores.Liridophores form a homogeneous silvery sheet below melanophores (birata et al., 2003).The colour of the iridophores may appear silvery, golden, brownish or blue, depending on the illumination (Xig.1A; Xig. 2).

Biological Background
Zebrafish stripe patterns consist of three main types of pigment cells: black melanophores, yellow/orange xanthophores, and silver/blue iridophores [12] (Fig. 1A).Xanthophores and iridophores are spread across the skin in two different forms (dense in light interstripes and loose in dark stripes), while black cells reside only in stripes [11,[29][30][31].As these cells undergo differentiation, division, death, migration, and form changes, they self-organize into four to five stripes and four interstripes sequentially over a few months [5].Cells regulate each others' behavior through communication at short range (between neighboring cells) and at long range (between cells in stripes and interstripes) (e.g., [9,16,[32][33][34]); see Fig. 1E.Importantly, this regulation is inherently noisy.For example, cells may interact by reaching extensions toward their neighbors [27,35,36]; whether or not cellular communication occurs then depends on if these extensions successfully find another cell.Previously, models [20,21] have used estimates of wild-type stripe width [27,37] and descriptions of developmental timelines (e.g., approximate times at which new stripes appear) [5,38,39] to judge model performance or fit parameters.Less data is available for zebrafish mutants, and, to our knowledge, global information is in the form of qualitative descriptions of the characteristic features of their patterns.Local measurements, in turn, include cell speeds [40,41] and distances between adjacent cells [33,40,42].Notably, we are not aware of measurements of pattern variability or stripe straightness.

Model and Generation of In Silico Pattern Data
The model [21] treats pigment cells as individual agents (point masses) and tracks their positions (namely (x, y)coordinates) in space as they interact on growing 2-D domains.The behavior of five different types of cell agents is accounted for: we let M i (t) be the (x, y)-coordinate of the ith melanophore (M ) at time t; similarly, X d i (t), X i (t), I d i (t), and I i (t) denote the locations of the ith dense xanthophore (X d ), loose xanthophore (X l ), dense iridophore (I d ), and loose iridophore (I l ), respectively; see Fig. 1K.Space is continuous, and cell movement, which includes repulsion and attraction, is modeled by coupled ordinary differential equations.Cell birth, death, and transitions in type, in turn, take the form of stochastic, discrete-time rules.These rules, which are strongly motivated by the biological literature (e.g., [12,16,34,40]), depend on the number of cells in disk and annulus neighborhoods centered at the cell or location of interest (see Fig. 1J).The authors [21] use these neighborhoods to model the cells that a given cell (or precursor) could communicate with (e.g., through direct contact [43], diffusing substances [34], or dendrite extensions [27,35,36] as in Fig. 1E).As an example cell interaction rule, interstripe cells are known to promote M differentiation at longrange [16,32], and these dynamics are modeled as follows: where z is a randomly selected location to be evaluated for possible cell birth; N d X , N d I , and N M are the numbers of X d , I d , and M cells on the domain, respectively; and Ω z long is an annulus centered at z that models long-range cellular communication (see Fig. 1J).According to (1), a new M cell appears at position z when the ratio of interstripes cells to M cells at long-range is greater than one.(Note that the interaction rules in [21] are given in terms of numbers, rather than proportions, of cells.We have adjusted the model [21] so that these rules depend on the ratios or densities of cells in different regions, as this framework works better for our large-scale study; see the Supplementary Materials for more details.) The agent-based model [21] can be used to simulate the full timeline of adult pattern formation from when it begins when the fish is roughly 21 days post fertilization (dpf).Because the model [21] is stochastic, simulating it repeatedly leads to different in silico patterns and, importantly, for our methods, cell-coordinate data.We thus generate an extensive data set by simulating the development of thousands of zebrafish patterns.We simulate wild-type development from 21 dpf until 66 dpf, at which point zebrafish are expected to have three complete interstripes, two complete stripes, and some partially-formed stripes near the boundaries.We simulate nacre and pfeffer pattern formation until 76 dpf and shady development until 96 dpf by turning cell birth off for the appropriate cell types as described in [21].(We note that experimentalists often use stages [38] rather than dpf to measure time; in the model [21], 66 dpf, 76 dpf, and 39-44 dpf correspond to the Juvenile, Juvenile+, and Squamation onset Posterior stages, respectively.)With one exception, we perform all of our analyses on the final simulated patterns at 66 dpf (for wild-type), 76 dpf (for nacre and pfeffer), and 96 dpf (for shady).Following the approach in [21], we enforce periodic boundary conditions in the horizontal direction and wall-like boundary conditions at the top and bottom of the domain (see Fig. 3A).To help avoid quantifying partially-formed stripes or spots, we remove the cells in the top and bottom 10% of the domain in post-processing.
To generate our first data set, we simulate wild-type, nacre, pfeffer, and shady patterns under the baseline conditions and parameters described in [21].We then adjust the model to account for more realistic biological stochasticity in cell interactions.In particular, rather than using deterministic length scales in the cell interaction rules, each day we select these length scales randomly per cell and interaction from a normal distribution centered at the default parameter value.In our last data set, we focus on the inner radius of Ω long in (1) and explore the role of this parameter while keeping all other parameters at their default values.

Topological Data Analysis and Machine Learning
Our approach to quantifying patterns relies on topological data analysis and machine learning.Topological data analysis (TDA) is an emerging branch of mathematics and statistics that aims to extract quantifiable shape invariants from complex and often large data [44][45][46][47][48].One of the main tools in TDA is known as persistent homology, which we review now briefly.Given a data set of N discrete points {x i } N i=1 that lie in some metric space (D, d D ), we place a ball of radius r at each x i to obtain the set b r (x i ) = {y ∈ D : d D (x i , y) ≤ r}.We then take the union of these balls over all i ∈ [1, N ], namely i∈[1,N ] b r (x i ).This process yields a new manifold with shape generated by the original data, and persistent homology tracks how the shape of this manifold changes as r increases (see Fig. 2).
For our work it suffices to view the dimension 0 and dimension 1 persistent homology groups as vector spaces whose dimensions correspond the number of connected components and loops, respectively, of the evolving manifold (see the Supplementary Materials and [44][45][46][47][48] for more details and definitions).The number of generators of the ith homology group is called the ith betti number, denoted β i .If a topological feature (e.g., connected component or loop) appears at some radius r b and disappears at some radius r d > r b , then we say this feature is born at r = r b and dies at r = r d , and its persistence is given by r d − r b .For example, because a figure eight has one connected component and two loops, this shape has β 0 = 1 and β 1 = 2. Now consider a noisy data set sampled from a figure eight, as we show in Fig. 2A.To compute the persistent homology of this data we take the union of balls of radius r centered around each data point for an increasing sequence of r values.Two loops appear in the data at r = r 2 and disappear before r = r 3 in Fig. 2B, so this data set has two dimension 1 homology generators that are both born at r b = r 2 and die at r d = r 3 (with persistence given by r 3 − r 2 ).Similarly, this data set is connected for r ≥ r 2 , so it has one dimension 0 homology generator for r ≥ r 2 with infinite persistence and several dimension 0 homology generators for r < r 2 .Thus, persistent homology reveals that the noisy data in Fig. 2A is topologically similar to a figure-eight shape (β 0 = 1, β 1 = 2) for r 2 ≤ r < r 3 .
In addition to using TDA, we apply methods from interpretable machine learning to quantify patterns.Machine learning algorithms seek to automatically learn information from a given data set for classification or prediction purposes [49,50].The machine learning approach we use involves clustering data into different classes based on a similarity measure.Specifically, we apply single-linkage clustering to subsets of agents (e.g., pigment cells) to identify clusters corresponding to spot or stripe patterns.Single-linkage clustering is an agglomerative hierarchical clustering method: each data point begins in its own cluster and points (or clusters of points) are merged sequentially based on which two clusters are closest to each other [49,50].We continue this process until there are n clusters, where n is either one or some predetermined number of desirable clusters.We use single-linkage clustering over other clustering algorithms (e.g., average linkage or k-means) to capture elongated, undulating, and non-spherical clusters that are characteristic of some zebrafish mutants (see Fig. 1).
As a side note, dimension 0 persistent homology is analogous to single-linkage clustering, so there is a natural connection between TDA-and clustering-based methods for pattern quantification [44].Using clustering and topological methods in tandem yields both multi-dimensional, coordinate-free summaries (from TDA) and essential information about the locations of different agents (from clustering).

Results: Our Methodology for Quantifying Patterns
We now use TDA and machine learning to develop our main result: an interpretable, agent-based methodology for automatically quantifying patterns that arise due to self-organization.We summarize our methods in Table S1 and illustrate how they can be applied to zebrafish in Fig. 3. Direct methods for measuring local pattern features are further presented in the Supplementary Materials.
Tailored to a specific application (zebrafish), our work opens up a new way of thinking about TDA tools and utilizing them to obtain detailed measurements of patterns.We expect that a similar approach can be used to study other patterns with data in the form of agent coordinates or images (with functional persistence).To help encourage further applications of TDA to self-organized patterns, we thus present our methods using general language in the next section, while also using zebrafish to highlight the kinds of application-specific considerations one must address when applying TDA to new data.In particular, one application-specific step involves determining what agent type(s) to use as input for topological feature computations.For example, multiple types of cells are present in the same pattern features on zebrafish (e.g., in Fig. 1A, both I d and X d appear in interstripes).Applying TDA to the locations of every agent type in a pattern is expensive.It may be sufficient to study only one or two agent types, but selecting which types to use requires application-specific considerations.
Fig. 3: Illustration of our topological techniques applied to zebrafish patterns.(A) Boundary conditions are periodic in the x-direction, so stripes and interstripes are viewed as loops from a topological perspective.(B) We count interstripes and measure stripe width using persistent homology.We show manifold expansions of the locations of X d cells by considering balls of growing radius r centered at the location X d i of each cell.When r = r 1 , the radius of the balls is about half the maximum distance between neighboring X d cells ∆ xx .At this point, three interstripes have formed, but the number of loops is larger than the true number of interstripes due to gaps between cells, highlighted by red arrows (β 0 = 3 and β 1 > 3).As r increases to r 2 , the noisy loops die off, leaving only three loops (β 0 = 3 and β 1 = 3).The long persistence of three loops corresponds to the true presence of three interstripes.As r increases further to r 4 , the manifold collapses to a single connected component (β 0 = 1 and β 1 = 1).The difference between the ball radius at which this collapse occurs (r 4 ) and the ball radius at which three loops appear (r 1 ) approximates half the maximum width of black stripes.(C) By combining TDA with clustering methods, we automatically detect interstripe boundaries and measure their curviness; we show the percent increase in arc length distance (ALD) of these boundaries (traced out in red) relative to perfectly straight stripes here.(D) We describe spotted phenotypes by combining persistent homology, clustering methods, and principal component analysis.We use β 0 to quantify the number of spots.As an example, we show the spot size and spot roundness for two nacre spots.

Counting Spots and Stripes
We compute the dimension 0 and dimension 1 persistent homology groups using the coordinate data of agents (e.g., pigment cell locations generated by the model [21]) to quantify pattern types, assuming periodic boundary conditions in the x-direction.With these boundary conditions, spots can be viewed as connected components without loops, whereas stripes wrap around the domain and are thus connected components with a single loop (see Fig. 3A-B).Consequently, β 0 and β 1 approximate the number of spots and stripes in a pattern, respectively. 1or zebrafish, we estimate the number of stripes and interstripes in wild-type patterns by computing β 1 for X l and X d cells, respectively.We apply TDA to these cells because they uniformly cover the fish skin, but in different forms in stripes and interstripes. 2We estimate the number of spots in nacre and pfeffer patterns by computing β 0 using the locations of blue I l cells. 3For pfeffer, individual M cells appear randomly on the domain, so using these cells to count the number of spots would introduce spurious connected components (in the form of individual black cells).In comparison, M are much more clustered in shady; thus, we calculate the number of dark shady spots by computing β 0 for M .
In general, we calculate betti numbers by applying persistent homology to the agents' coordinate data and using a persistence threshold to count the number of homological generators whose persistence is greater than the set threshold (T p ). Empirical estimates of cell-cell spacing motivate our choice of T p for zebrafish.Specifically, we use T 0 p = 100 µm and T 0 p = 90 µm as the dimension 0 persistence thresholds for iridophores and melanophores, respectively.We chose these thresholds conservatively, as average xanthophore-xanthophore neighboring distances are 30-60 µm and average melanophore-melanophore distances are roughly 50-60 µm in wild-type [21,33,40,42].(We are not aware of empirical measurements of iridophore spacing.)For dimension 1 homology, we use a universal persistence threshold of T 1 p = 200 µm.Moreover, to ensure that we correctly differentiate between complete and broken stripes or interstripes, we specify that a persistence generator only counts toward β 1 if its birth radius r b is below a certain threshold (T 1 b ).For X l and X d , we use T 1 b = 100 µm and T 1 b = 80 µm, respectively.These thresholds were motivated by empirical cell-cell distance measures [33,40,42] and tuned based on parameter fitting experiments with stripe and interstripe breaks.
Simultaneously, we can use persistent homology to identify stripe breaks when the number of expected stripes is known (see Fig. 4A for examples of stripe and interstripe breaks).Namely, we flag a stripe break when β 1 is less than the expected number of stripes.Here we additionally consider β 1 of the M cells, with T 1 b = 90 µm and T 1 p = 200 µm.We compute β 1 for both X l and M because the former appear at low density in dark stripes; computing β 1 for both cell types allows us to be more confident in our results.As we discussed in Background and Methods, we expect that our simulated zebrafish patterns have two fully formed stripes and three fully formed interstripes at the time of our analysis, so we flag a stripe break when β 1 < 2 for both X l and M cells.Similarly, we flag an interstripe break when β 1 < 3 for X d cells.

Measuring Stripe Width
Beyond quantifying the number of stripes or spots, we leverage TDA to approximate stripe and interstripe widths.In particular, we estimate (inter)stripe widths using the persistence (r d − r b ) of the significant dimension 1 persistence points.We define significant dimension 1 persistence points as those with persistence greater than or equal to T 1 p and birth radius r b less than or equal to T 1 b .For example, the persistence of a stripe loop is the difference between the radius value (r d ) at which two adjacent stripes combine to form a single loop and the radius value (r b ) at which the stripe feature initially formed (here we ignore the features that persist to infinity).This difference (r d − r b ) corresponds to half of the maximum distance between two adjacent stripes, capturing the maximum width of the enclosed interstripe (see Fig. 3).
In wild-type zebrafish, twice the persistence of the yellow X l loops yields an approximation for the maximum interstripe width across the fish.Similarly, twice the persistence of the orange X d loops approximates an upper bound on stripe width.We note that r d alone could be used as an alternative estimate for maximum (inter)stripe width, but we use r d − r b to account for the narrow boundary region between stripes and interstripes on zebrafish.To obtain a lower bound on stripe width, one could calculate the persistence of the significant dimension 0 persistence points, as this measurement is based on half of the minimum distance between two adjacent interstripes.

Measuring Spot Size
We measure spot size by applying single-linkage hierarchical clustering to the agents of interest with the number of desired clusters (e.g., number of spots) set to the β 0 values we obtained from our topological analyses.Then, we count the number of cells per cluster to approximate the size of each spot.We define "spot size" as the median number of agents per spot across all of the spots.

Quantifying Stripe Straightness
To measure "stripe curviness" we compute the arc length distance (ALD) of the boundary of each single-linkage cluster that corresponds to a stripe.We define our stripe curviness measure to be the average percent increase of this ALD from the ALD of straight stripes: For example, to measure the curviness of wild-type zebrafish stripes, we apply single-linkage clustering to the locations of X d cells.For the number of desirable clusters n, we use the number of expected interstripes minus the number of stripe breaks that we identified with persistent homology (see Fig. 3C).We then calculate the ALD for the resulting clusters and compute stripe curviness using (2).

Quantifying Spot Roundness
To estimate spot uniformity, we use the clusters identified via single-linkage hierarchical clustering (with the number of desired clusters set to the β 0 values).We then apply principal component analysis (PCA) to each cluster.The eigenvalue decomposition in PCA provides information about how varied the data is in each dimension.Since our data is twodimensional, we use PCA to evaluate the spread of each cluster in the xand y-directions.If a spot has significantly more variance in one direction, this indicates that it is irregularly shaped or elongated.Specifically, we define our roundness measure as: We assume that a PCA eigenvalue ratio close to one implies round spots, while a PCA eigenvalue ratio 1 indicates irregular, non-uniform spots (see Fig. 3D for examples).

Determining Spot Alignment and Center Width
We quantify spot alignment by first applying single-linkage hierarchical clustering to agent locations (with the number of desired clusters set to the β 0 values).We then calculate the pairwise l ∞ distances between the cluster centroids and complete a nearest-neighbor search with the l ∞ metric. 4This allows us to extract the distance from each spot to its closest neighboring spot.We define the spot-spacing variance as the standard deviation of these nearest-neighbor l ∞ distances.A large spacing variance corresponds to non-uniform spot placement, while a low spacing variance predicts well-aligned spots.
Motivated by the nacre and shady patterns, which feature expanded light central regions (see Fig. 1G and Fig. 1I), we also use the cluster centroids to approximate the center width, defined as twice the distance from the midpoint of the domain to its first spot.In particular, we estimate the center width as twice the minimum distance from the cluster centroids to the midpoint of the domain, minus the median spot diameter.Here we define spot diameter as twice the greatest Euclidean distance from the spot's centroid to cells belonging to the spot.For zebrafish, the center radius corresponds to the width of the central interstripe X0; see Fig. 1F-G.

Capturing Pattern Formation Events
Thus far, we have focused on quantifying pattern features at a snapshot in time.However, for self-organization that occurs during organism development, it is also useful to estimate the time at which specific features emerge.For example, in wild-type zebrafish, the second and third interstripes X1V and X1D (see Fig. 1F) develop around 39-44 dpf (based on approximations [21] of images in [12,38]).This information on target time dynamics serves as an additional quantitative measurement that can be used to evaluate models.Here we present a method for quantifying the time at which stripes X1V and X1D form; future work could extend these methods to capture the time dynamics of spot formation and other features.
Given data in the form of agent locations at consecutive time points, we first assume new stripes form somewhere between day d 0 and d 1 .If there is no prior knowledge about the expected time of stripe development, one can set d 0 and d 1 to the first and last days of pattern development, respectively.For zebrafish, because the model [21] was parameterized so that interstripes X1D and X1V form around 39-44 dpf, we conservatively set d 0 = 32 dpf and d 1 = 62 dpf.Within the specified time interval, we then analyze the patterns sequentially beginning at d 0 , assuming there is initially a single stripe on the domain.At each time step, we find the upper and lower bounds of the stripes by computing the maximum and minimum, respectively, of the y-coordinates of the agents of interest (e.g., for zebrafish, we use X d cells).Finally, we estimate the initial formation of new stripes as the first day at which the upper or lower bounds of the stripes increase by more than some threshold from the previous day.For zebrafish, the threshold we use is 200 µm. 5esults: A Quantitative Study of Zebrafish Patterns We now study zebrafish pattern variability and robustness by analyzing thousands of in silico wild-type and mutant patterns generated using the agent-based model [21].Quantitatively evaluating data of this scale is possible because of our automated framework.As a baseline test, we begin by illustrating our techniques on simulations of wild-type zebrafish stripes.Because our analysis there is consistent with previous characterizations collected visually and local pattern measurements, we then use our methods to extract quantifiable features from mutant patterns and measure pattern variability in the presence of increased stochasticity in cell interactions.We conclude by showing how our methods can be used to detect the impact of changing a given model parameter without the need for visual inspection.
We view our results in the next sections as presenting a broader, more objective picture of the behavior of the agentbased model [21].Additionally, because this model is closely based in the biological literature, our results serve to predict the kind of pattern variability we expect to see in vivo based on the model [21].As large-scale collections of experimental images become available, our predictions can be tested by applying our techniques to in vivo images of zebrafish as well.

Illustrating our Techniques on Wild-type Zebrafish
We focus on stripes first because they provide a means of testing our methodology, as wild-type patterns have the most experimental data (collected both in silico and in vivo) available for comparison.Here we use our methodology to evaluate 1, 000 wild-type zebrafish patterns generated with the model [21] under the default parameter regime.Previously, model performance [21] was judged by manually counting the number of stochastic simulations that display breaks (or interruptions) in interstripes and requiring matches in pattern features (e.g., number of interstripes present) at major developmental timepoints.In particular, by inspecting 100 in silico patterns, the authors [21] reported a success rate of 89% according to the former goal, meaning that 89 out of 100 simulations had no interruptions in interstripes.(Note that breaks in black stripes are occasionally seen on real fish, so these interruptions were not quantified in [21].)Our methodology allows us to analyze much larger data sets and remove any human error from the process; we demonstrate how topological methods can be used to detect stripe breaks automatically in Fig. 4A.Across 1, 000 wild-type simulations, we find that 87.5% have no breaks in interstripes (flagged by a decrease in β 1 for X d cells).This agrees well with the success rate in [21] that was computed using visual inspection.
As an additional evaluation measure, we manually viewed 200 model outputs and found that the betti numbers capture interstripe breaks with 100% accuracy and only one false positive.In a similar vein, the model [21] was parameterized so that interstripes X1D and X1V (see Fig. 1F) form between 39-44 dpf, but until now this property was judged by visual inspection.Using our automated methods, we show the distribution of times at which these interstripes develop in Fig. 4B and find good agreement with the target pattern milestones in [21].
Fig. 4C-D show the distributions of interstripe width and stripe curviness across 1, 000 wild-type simulations.The maximum interstripe width, measured by the persistence of the significant dimension 1 persistence points of X l , represents the maximum separation between adjacent stripes.We find that this quantity has a mean of about 814 µm and a standard deviation of approximately 49 µm, which is similar to the average distance between cells [40,42], suggesting that variance in stripe width may be caused by the addition or loss of one cell.Similarly, in Fig. 4D, we show measurements of wild-type stripe curviness (2), a dimensionless quantity that could be compared to empirical data in the future.More generally, Fig. 4B-D provide a baseline measurement of the model output [21] that we use to compare to further studies.

Quantifying "Characteristic" in Noisy Mutant Patterns
The nacre, pfeffer, and shady mutants lack specific cell types, leading to altered patterns, which are highly variable and can be broadly described as spotty (see Fig. 1B-D).Here we use our methods to analyze 1, 000 in silico patterns generated with the model [21] under the default parameter regime for each mutant.Our results, shown in Fig. 5, serve as quantitative descriptors of what constitutes "characteristic" for each mutant (according to the model) and demonstrate our methods' abilities to extract quantifiable differences between spot patterns.Among the three mutants, we find that pfeffer has the most spots and that these spots are the most round and the most evenly spaced (Fig. 5A, C-D).In comparison, nacre and shady have a similar number of spots, but the spots on shady are smaller and rounder than those of nacre.(As we noted in Background and Methods, we remove a small region at the top and bottom of the domain prior to our analysis to avoid quantifying partial spots.)Moreover, the width of the central X0 interstripe in pfeffer is closest to wildtype interstripe width (see Fig. 4C), while both nacre and shady feature expanded central interstripes, echoing empirical observations [12].Interestingly, we find that the variance in the number of spots for all three mutants is small (a standard deviation of about two spots).With the exception of nacre, which displays the greatest variability in four of the five measurements we present in Fig. 5, the variance in spot spacing and the width of the central interstripe X0 is also small (on the order of the distance between neighboring cells [40]).In the future, it would be interesting to compare these quantities to large-scale in vivo data and determine what cell interactions in the model [21] are responsible for selecting them robustly.

Measuring Pattern Variability
Some cellular interactions on the zebrafish skin are thought to be regulated by direct contact, dendrites, or longer projections [27,35,36] (see Fig. 1E).To account for this, the model [21] assigns disk (short-range communication) and annulus (long-range communication) interaction neighborhoods to each cellular agent (see Fig. 1J).Cell birth, death, and form transitions are then governed by rules (e.g., (1)) that depend on the proportion of cells within these neighborhoods.The size of the neighborhoods dictates which cells are able to interact and therefore plays a critical role in patterning.While   1G).We show KDE curves with a Gaussian kernel overlaid on the histograms of data; the mean plus/minus the standard deviation is shown in each plot for the data.
the interaction neighborhoods have deterministic sizes (based on empirical measurements [27,36,40]) in [21], a more realistic model should account for stochastic variations in cell size and projection length.Randomly varying the length scales involved in the interaction neighborhoods serves as a means of including more realistic cellular communication (which could also include diffusion of signaling factors [34] in the future) in agent-based models of zebrafish.As a first step toward including more realistic stochasticity, we therefore replace the deterministic length scales in the model [21] with stochastic length-scale parameters and measure their effect on pattern variability.This models the presence of randomness in cell interactions due to variations in cell size and projection length.Interaction neighborhoods appear in 17 places in the rules that govern M birth, M death, iridophore form changes, and xanthophore form changes in the model [21].For each cell interaction, we randomly select the size of the associated interaction neighborhood per cell per day from a normal distribution with the mean set to the default parameter value.We vary the standard deviation from 1-50% of the mean and for each standard deviation (we consider σ ∈ {0.01, 0.05, 0.1, 0.2, 0.3, 0.5}, where σ times the default length scale is the standard deviation of the normal distribution), we run 1, 000 simulations each for wild-type, nacre, pfeffer, and shady. 6Our goal in this study is two-fold: first, we aim to make quantitative predictions comparing variability in wild-type and mutant patterns, and second, we seek to identify the range of patterns these fish may display in the presence of stochastic cellular communication.
In order to quantitatively explore how additional stochasticity impacts patterning, we first need to define what it means for a pattern to look the same as (or different from) what we would expect characteristically.For wild-type, this is immediate: we characterize wild-type patterns in terms of stripe and interstripe breaks.For nacre, pfeffer, and shady, however, the process is more challenging because these mutant patterns are messier.For example, from looking at the   images of nacre in Fig. 1B and Fig. 1G, it is not clear at what point in silico patterns consisting of elongated, orange globs should be considered good or bad matches for nacre.This is where our baseline analysis of nacre, pfeffer, and shady plays a role.We use our earlier analysis of simulations in the default parameter regime to identify patterns that fall outside of what constitutes "characteristic" for each of these mutants (in terms of number and size of spots).For each mutant, we set our thresholds for small and large spots to be the minimum and maximum values, respectively, of the cluster-size measures that we found in our baseline experiments with that mutant.Analogously, for each mutant, we set the threshold for what constitutes few (many) spots to be the minimum (maximum) number of spots we found in our baseline simulations with that mutant.In Fig. 6A-D, we show how prevalent various patterns are across our stochastic simulations for different levels of noise in cell-interaction length scales (see Tables S2-S5 in the Supplementary Materials for additional measurements).
As an agglomerate summary across all 6, 000 simulations that we generated for different σ values, Fig. 6E-H provides examples of the different patterns categorized by our methods for wild-type and each mutant.Our results in Fig. 6A-D suggest that wild-type and mutant patterns behave differently in the presence of noise.In particular, all three mutants have characteristic spots in less than 50% of the model outputs when σ ≥ 0.2, while wild-type patterns retain characteristic unbroken stripes and interstripes more robustly.If we take a closer look at individual pattern features in Fig. 6I-J, we note that low levels of noise (σ ≤ 0.1) serve to straighten stripes and that stripe width is mostly unaffected by the inclusion of noise in cell size and projection length.As stochasticity increases, wild-type patterns display a gradual decay in quality over the range of σ values that we consider.With increasing noise, we find more breaks in interstripes, wider interstripes, curvier stripes, and marginally slower pattern formation (see Table S2).Wild-type stripes do not appear to completely deviate from characteristic until σ = 0.5, at which point broken stripes become the norm.
In comparison, the mutant patterns are almost unaffected by noise for σ ≤ 0.1, but then undergo a sharp change in pattern features as σ increases.When nacre and pfeffer stray from characteristic, we mostly observe small spots or scattered cells (Fig. 6B-C).Noisy length scales in shady, in turn, generally produce patterns with few or no dark spots (Fig. 6D).Related, Frohnhöfer et al. [12] observed that strong forms of the shady mutant have no spots.As we show in Fig. 6B-D and Tables S3-S5, spots on all three mutants retain their characteristic roundness across a range of σ values, only deviating substantially from the measures in Fig. 5 when σ = 0.5.
To roughly approximate the amount of noise present in cellular length scales in vivo, we estimate the standard deviation reported for the distance between neighboring xanthophores [33] and the length of their filopodia extensions [31].Based on graphs in [33], we estimate that the distance between the centers of neighboring X d cells (at 40 dpf) is 27 µm with a standard deviation of 4.6 µm; in our notation, this means that σ = 27/4.6,so the standard deviation is about 17% of the mean.Similarly, using graphs in the SI of [31], we estimate that the longest xanthophore extensions (measured from the cell center) have a standard deviation in length that corresponds to 12% and 20% of the mean filopodia lengths before and after iridophores arrive on the skin, respectively (in particular, we find that the filopodia length before iridophores arrive is approximately 58 µm ± 6.7 µm, and the filopodia length after iridophores arrive is approximately 25 µm ± 5 µm).These measurements suggest that focusing on the patterns that emerge when σ is between roughly 0.1 and 0.2 in our simulations may have particular biological relevance.We caution that this approximation is based on variance in short-range length scales only, and cells may also communicate through long-range projections [27,36] (as well as diffusion of signaling molecules [34]); moreover, in comparing these measurements to our simulations, we are inherently assuming that the empirical data has a normal distribution.
Motivated by our estimates of standard deviation in vivo, we explore what our analysis predicts when σ ∈ [0.1, 0.2].As we note in Table S2, we find that wild-type stripe width, stripe curviness, and the time of formation of interstripes X1V and X1D are robust in this range of σ.Our methods allow us to estimate that 84.8% and 78.1% of the wildtype patterns for σ = 0.1 and σ = 0.2, respectively, feature characteristic unbroken interstripes (recall, 87.5% of our simulations in the baseline experiments with σ = 0 have unbroken interstripes).Echoing empirical observations [12] that mutant patterns are more variable than wild-type, we find that the model [21] supports a distribution of mutant patterns for σ ∈ [0.1, 0.2].In particular, we predict that the representative images of nacre, pfeffer, and shady in Fig. 1B-D and Fig. 1F-H are characteristic of these mutants in the sense that roughly half of the associated fish may resemble them, while we expect that the remaining fish resemble versions of these images with fewer and smaller spots.Crucially, we predict that the mutants do not commonly display larger spots than those in Fig. 1F-H.In the future, analyzing extensive collections of empirical images will allow one to test our predictions and the model [21].

A Means of Linking Altered Cell Behavior to Mutant Patterns
Thus far, we have focused on exploring wild-type patterns and the nacre, pfeffer, and shady mutants.Based on transplantation experiments [10][11][12], these mutant patterns seem to arise because a cell type is missing, rather than due to altered cell interactions.Zebrafish also feature a second type of mutant pattern that forms because cell behavior is altered (often in unknown ways) despite all cell types being present.Examples of this second type of mutant include leopard and obelix, which feature spots and widened stripes, respectively [10].Mutations that alter cell behavior provide modelers with an opportunity to help link genes to cellular function.(We note that many zebrafish genes have an orthologue in the human genome [51].)One can adjust cell behavior in a model to search for patterns that match various mutants; in this way, a modeling approach can help establish links between cell behaviors and the genes that control them through the phenotype.Agent-based models (e.g., [20][21][22]) often have a large number of parameters, however, and this makes it challenging to comprehensively screen for the cellular interactions that may be related to various mutants by adjusting parameters and visually inspecting the resulting simulations.In a similar vein, modelers seek to present a broad picture of the impact of varying different parameters, but this process is again often limited by the time-consuming nature of visual inspection.We expect that our methods can be used to help address these challenges, and we provide one example to illustrate this process next.
As an example study, we vary a single parameter in the model [21] across a range of values and apply our methods to the resulting patterns.In particular, we focus on the cellular interaction radius represented by Ω long in (1).As shown in Fig. 1J and discussed in Background and Methods, long-range interactions depend on the proportion of cells in an annulus region Ω long in the model [21].(1) describes M birth as occurring at randomly selected locations z when the number of I d and X d cells in Ω z long is sufficiently larger than the number of M in this annulus.This models empirical observations that M differentiate from precursors or stem cells [39,52,53] and that X d and I d in neighboring interstripes support black cell birth, while other M inhibit it [16,32] at long-range.In [21], the inner radius of the annulus Ω long is 210 µm (motivated by in vivo measurements of cellular extensions in [27,36]) and the width of the annulus is 40 µm.
Here we vary the inner radius parameter from 10 to 400 µm in increments of 25 µm and run 100 simulations under each parameter regime for wild-type, pfeffer, and shady7 .This allows us to comprehensively explore the impact of long-range signaling on M differentiation.
If Ω long in (1) is too small (e.g., when its inner radius is below 30-80 µm, the average distance between cells [33,40]), it is likely that there are no or very few cells in this annulus region, so that the signal from X d and I d to promote M cell birth is effectively turned off.Intuitively, this should lead to an M shortage in the resulting patterns.Conversely, we expect that increasing the inner radius of Ω long will widen black stripes.To test these hypotheses and determine the role of this parameter in wild-type, we use our methodology to measure stripe width, interstripe width, and stripe curviness across a range of Ω long values.For pfeffer and shady, we compute spot size, number, and roundness as a function of Ω long in (1).We present our results in Fig. 7 using kernel density estimation plots to visualize the 2-D probability density function of pattern features and parameter values.As we show in Fig. 7A-B and 7E, there is a strong positive correlation between the spatial scale of long-range signaling in M birth and stripe width, interstripe width, and spot size.To check that the quantities we detected automatically agree with results by visual inspection, we show a few sample simulations in Fig. 7G for different Ω long values.As expected, we find that the width of black stripes in wild-type increases as the spatial scale of long-range signals promoting M birth increases.Conversely, when the scale of M -birth signals in (1) is very local, the resulting patterns vaguely resemble the nacre mutant, which features no melanophores [10,12].This highlights the importance of large-scale simulations and automated methods, as they allow comprehensive model explorations and provide a more complete picture of the roles of different parameters.
To further explore our results, we ran a linear regression analysis on the pattern quantities that we present in Fig. 7.For wild-type, we find that a linear model in stripe width yields a coefficient of determination R 2 = 0.912, meaning that the linear model captures 91.2% of the observed stripe width variance.For shady, a linear model in spot size has a corresponding R 2 = 0.901, while a linear model in spot size for pfeffer has a lower goodness-of-fit (R 2 = 0.722) because the spot size increases more rapidly.Regression models of this type can be used to predict pattern quantities without needing any reference data.In particular, these simple regression models have the potential to allow one to predict pattern features as a function of cellular interaction signals without needing to perform any model simulations.
The results of our case study exploring the impact of a single parameter (related to long-range signals in melanophore differentiation) show promise.In particular, they suggest that our methods can be applied not only for pattern quantification but also for model sensitivity analysis and large-scale parameter screening to detect possible ways that cell interactions may be altered in mutations.We note that our goal here is to highlight another way our methods can be applied, and we leave a more thorough investigation of zebrafish mutations and the altered cell interactions involved for future work.For example, the obelix mutant [10] features widened stripes due to unknown altered cell interactions; by systematically varying parameters in the model [21] and automatically detecting their impact on stripe width, one could identify a set of altered cell behaviors that may be responsible for this phenotype, and these predictions could then be evaluated experimentally.

Discussion and Conclusions
Our goal was to provide methods for quantifying agent-based patterns across a range of scales.Leveraging topological data analysis and machine learning, we developed a new methodology that captures information spanning local features of interacting cells up to macroscopic spots and stripes.Because it describes shape features across a sequence of spatial scales, persistent homology is a critical tool in our methods.We showed that combining this topological tool with clustering methods yields a collection of summary statistics that can be automatically extracted from patterns using agent coordinates.By reducing the role of visual inspection in describing patterns, our interpretable methodology provides a new means of analyzing large data sets and studying how stochasticity in agent interactions affects pattern variability.To illustrate the promise of our methods, we applied our methodology to an extensive data set of in silico zebrafish skin patterns that we generated using the agent-based model [21].Our methods allowed us to make quantitative predictions about the types and amounts of variability that may arise in wild-type and mutant zebrafish patterns due to stochasticity in cellular communication.We used our methods to distinguish and characterize similar mutant patterns, and we showed how to track pattern features across spatial scales to study the role of different cellular interactions in pattern formation.Many of our results, which provide a broader view of the agent-based model [21], can be experimentally tested in the future.In particular, after extracting cell coordinates from zebrafish images, one could compute summary statistics for the empirical data and compare these measurements to our simulations.Our methods could also be applied to other models of zebrafish patterning, including partial differential equations (e.g., [9,[16][17][18], stochastic cellular automaton perspectives [18,19], and agent-based models (e.g., [20,22]).In the future, one could use our methods to optimize model parameters or conduct large screens for cell interactions that may be altered in mutations.Although we focused primarily on analyzing zebrafish patterns at a fixed point in development, future work could track pattern features across developmental timelines.
Our approach to quantifying zebrafish patterns begins to address major challenges associated with quantifying agentbased dynamics in an objective and automated way, but there are also limitations to our methods.First, we make underlying assumptions about the patterns that we are studying.As an example, when we use topological methods to quantify spots or stripes, we assume that the input patterns have certain features (e.g., we assume a wild-type input has stripe patterns).It may be useful for future studies to automatically classify each input pattern as spots or stripes prior to applying the appropriate pattern quantification methods.Moreover, we focused primarily on spots and stripes, but methods for characterizing other patterns (e.g., labyrinth patterns on the choker mutant [12]) could be developed in the future.Lastly, we note that we built our methodology to take data in the form of agent coordinates.Empirical images and simulations from partial differential equation models, however, are continuous functions defined over 2-D domains.In the former case, one option would be to extract cell locations from image data, and, in the latter, one could apply our methods to cell densities after discretizing space and applying a density threshold.Fortunately, functional persistent homology could avoid both of these extra steps as it takes function data as its input.In the future, one could apply our approach to continuous pattern data by replacing the TDA tools that we used with functional persistence throughout our methodology.
Although we focused on analyzing pattern variability in zebrafish, we expect that a similar approach can be used to quantify agent-based dynamics in other biological settings.Methods that provide summary statistics for pattern features across a range of length scales open up many possibilities for quantitatively comparing large data sets of in silico and in vivo pattern data in the future.By working closely with the needs of each application, we expect that our topological perspective can be extended to analyze agent-based dynamics in wound healing, animal flocks, and other forms of collective behavior.
The goal of persistent homology is to associate homology groups to data.For the scope of our paper, we can think of homology groups as vector-space representations of a topological object with dimension corresponding to the number of "holes" that object has.(For more details on algebraic topology, see [63].)Specifically, the 0th and 1st dimension homology groups of an object are vector spaces whose dimensions are the number of connected components and loops, respectively, that the object has.For example, a figure-eight object has a single connected component and two loops (see Fig. 2 in the main manuscript and Fig. S2).In other words, the 0th homology group of a figure eight has one generator, and its first homology group has two generators.
Next consider a data set S = {x i } i∈I , where S is any collection of points living in a given metric space (D, d).For example, S could be a set containing the coordinates of all of the pigment cells in a zebrafish skin pattern.If we were to extract the homology groups of S directly, we would get uninformative representations.In particular, the 0th homology group of S has dimension equal to the number of data points in S and all of the higher homology groups are trivial (e.g., if S contained the coordinates of pigment cells, the 0th homology group of S would simply be the number of cells in S).Instead, in persistent homology, one builds simplicial-complex representations of S to infer the homology of the manifold from which the data were sampled.In our example, this allows us to study the qualities of the stripes and spots that are made up of the pigment cell coordinates in S.
A k-simplex of k + 1 affinely independent points is the convex polygon whose vertices are precisely the k + 1 affinely independent points [46,47].In other words, a 0-simplex σ 0 (x 1 ) is a single point x 1 ; a 1-simplex σ 1 (x 1 , x 2 ) is an edge connecting x 1 and x 2 ; a 2-simplex σ 2 (x 1 , x 2 , x 3 ) is a filled triangle whose vertices are x 1 , x 2 , and x 3 ; and so on.To begin building simplicial-complex representations of S, we place a ball of radius r centered at each x i ∈ S to obtain {B r (x i ) = {y ∈ D : d(x i , z) ≤ r}} i∈I .The simplicial complex representation of S with respect to r is the union of all k-simplices σ k (x j 0 , . . ., x j k ) such that B r (x j l ) ∩ B r (x j j ) = ∅ for all l, j = 0, 1, . . ., k.This is known as the Vietoris-Rips complex of S with respect to r.We can now compute the homology groups of this simplicial-complex representation of S. See Fig. S1 for examples of different simplicies.(D-E) Given 3 points (namely, x 1 , x 2 , and x 3 ), we show their Vietoris-Rips simplicial complex with respect to two separate radius parameters.In panel (D), the radius parameter r is small enough so that B r (x i ) ∩ B r (x j ) = ∅ for all i, j = 1, 2, 3, i = j.Consequently, the Vietoris-Rips complex with respect to the r value in (D) simply consists of three 0-simplices: σ 0 (x 1 ), σ 0 (x 2 ), and σ 0 (x 3 ).As the radius parameter r grows in panel (E), we find pairwise non-empty intersections B r (x 1 ) ∩ B r (x 2 ) = ∅, B r (x 1 ) ∩ B r (x 3 ) = ∅, and B r (x 2 ) ∩ B r (x 3 ) = ∅, so the resulting Vietoris-Rips simplicial complex is the union of three 0-simplices σ 0 (x 1 ), σ 0 (x 2 ), and σ 0 (x 3 ); three 1-simplices σ 1 (x 1 , x 2 ), σ 1 (x 1 , x 3 ), and σ 1 (x 2 , x 3 ); and one 2-simplex σ 2 (x 1 , x 2 , x 3 ).
Inevitably, the homology groups of the simplicial-complex representation of S will be sensitive to the choice of scaling parameter r.To overcome this challenge, persistent homology instead tracks how the homology groups change across an increasing sequence of scaling parameters {r j } j∈J .The scaling parameter at which a homological generator appears is called the birth radius (r b ) of a topological feature.Similarly, the value of r at which a homological generator disappears is called the death radius (r d ) of a topological feature.The persistence of a topological feature is its lifetime, namely r d − r b .
In TDA there are two primary ways of visualizing the persistent homology of a data set.The first method, called a barcode diagram, is a collection of bars.Each bar in a barcode diagram represents a topological generator; the left endpoint of a bar corresponds to the birth radius of that generator, and the right endpoint of a bar corresponds to its death radius.Topological features that persist for across a longer range of radii values are identified by having longer bar representations in the barcode diagram.The second method, called a persistence diagram, is a collection of points ; and, the two triangular points in pink in the top left corner of the diagram correspond to the two persistent dimension 1 features (namely, the two loops in our figure-eight data).We calculated persistent homology using Ripser [54]. in R 2 ≥0 .Each point in a persistence diagram corresponds to a topological feature, and the xand y-coordinates of the points are the birth and death radii, respectively, of those features.In a barcode diagram, topological features that persist across a broader range of parameter values r lie farthest away from the diagonal line y = x.These persistent homology visualizations provide a low-dimensional, descriptive representation of the original data S and are therefore useful for a range of data-driven tasks.For example, in Fig. S2F-G

Measuring Local Pattern Features
While TDA and machine learning offer exciting insights into biological patterns, raw calculations of agent-agent distances and agent-density counts are still important measurements.We therefore use direct calculations in tandem with the aforementioned methods for tracking pattern variability.In particular, we calculate the mean and variance of nearestneighbor distances between agents directly using location data.The coefficient of variation (CV) for the distance between  Fig. S3: Measurements of local pattern features.We show distribution plots for (A) melanophore coefficient of variation [11,42], (B) mean M -M spacing, (C) mean X d -X d spacing, (D) mean X l -X l spacing, and (E) mean M -X d spacing.
We base these distributions for wild-type, nacre, pfeffer, and shady patterns on 1, 000 model simulations (for each pattern type) under the default parameter regime in [21].We indicate the mean plus/minus the standard deviation for these measurements in each figure.
For zebrafish, low values of the M CV are associated with better-formed patterns [11,42].We compute A-B agent density counts by counting the number of agents of type A per local neighborhood of the query agent type B, where local neighborhoods are defined as disks of radius R centered at the query agent (we use R = 250 µm for X l cells and R = 200 µm for I l cells).We then use the 20th percentile of agent counts across all local neighborhoods as our summary statistic (other summary statistics could also be used).It is helpful to note that the A-A agent density is just the traditional measurement of the density of agent A, but we find that extending this definition to allow for counting one agent type against another is helpful in some cases.For example, in the pfeffer mutant pattern (see Fig. 1C and 1G in the main manuscript), blue I l cells are present only when there are sufficient black M nearby.To calculate the I l -M density, we count the number of M cells within a local neighborhood of each I l cell and take a summary measure across all neighborhoods.
We present the M CV and nearest-neighbor cell-cell spacing measurements from 1, 000 model simulations under the default parameter regime for wild-type, nacre, pfeffer, and shady in Fig. S3, and we show X l -M and I l -M cell-density counts in Fig. S4.We observe that the shady mutant has the largest variability in M CV (Fig. S3A).Moreover, the average M -M nearest-neighbor spacing is smallest for shady and greatest for pfeffer (Fig. S3B).Interestingly, the distributions of the mean X d -X d nearest-neighbor spacing are completely disjoint for wild-type, nacre, and shady.The wild-type X d -X d nearest-neighbor spacing is the smallest and the shady X d -X d nearest-neighbor spacing is the largest, with all three types having low variance (Fig. S3C).In contrast, the wild-type M -X d nearest-neighbor spacing is significantly greater than the shady M -X d nearest-neighbor spacing (Fig. S3E).Finally, we note that the X l -M (Fig. S4A) and I l -M (Fig. S4B) density counts are greater for wild-type than they are for the mutants.

Summary of Model Rules Adjusted in our Study of Pattern Variability
In Fig. 6 in the main text, we quantify the pattern variability induced by replacing the original deterministic length scales used in the model [21] with stochastic length scales.In particular, the model rules in [21] depend on five interaction neighborhoods: B z 75 , B z ∆xm , B z 90 , and B z 90/2 denote the disks of radii 75 µm, 90 µm, ∆ xm = 82 µm, and 90/2 = 45 µm, respectively, centered at the position z of the cell (or precursor) of interest, and Ω z long is the annulus of inner radius 210 µm and width 40 µm centered at position z.With the exception of B z ∆xm , which is used in the rules that prevent cell overcrowding due to continual birth, we replace these deterministic neighborhoods with stochastic versions to generate our results in Fig. 6  where we consider σ ∈ {0.01, 0.05, 0.1, 0.2, 0.3, 0.5}, so that the standard deviation of the noise is a fraction of its mean. 8We write B z 90/2,σ instead of B z 45,σ to stress that, when we randomly select r ∼ N (90, σ • 90) µm, we use that same r value to define both the randomly-chosen radius of the disk B z 90,σ and the randomly-chosen diameter of the the disk B z 90/2,σ .This means that the radius of B z 90/2,σ is always half the radius of B z 90/2,σ for the cell at position z (we made this choice because the size of these disks depends on a single parameter in the model [21]).In all other cases, following the approach in [21], the length scales involved in the disk and annulus neighborhoods are chosen independently.
We include these stochastic cell-interaction neighborhoods in the model rules for M birth, M death, and cell-form transitions.Because it makes the model more amenable to testing a wide range of length scales, we also reframe the model rules in [21] so that the rules for M birth, M death, and xanthophore form transitions depend on the ratios of cell counts in the interaction neighborhoods rather than absolute cell counts, and the rules for iridophore form transitions depend on density counts.While ideally all rules would be modified to be in terms of density counts, this is only feasible for the iridophore form transition rules that have only one interaction neighborhood per inequality.
We present the adjusted model rules that we use to generate our results in Fig. 6 in the main text in (S6)-(S11) below.We refer to [21] for their biological motivation; for each rule, we select each stochastic length scale (namely, (S2)-(S5)) randomly per cell per day (i.e., the time step for cell interactions used in [21]).In all other cases, the parameters have the default values given in [21].Specifically, we adapt the model rule for M birth at position z from [21] to the following: , Ω z long } is the indicator function for the region R (in particular, 1 R (x) = 1 if x is in the region R and 0 otherwise).Our adapted rule for melanophore death due to local competition [16,21], in turn, is given by: > µ =⇒ death of melanophore at M i due to local competition with X d . (S7) The model [21] also includes a rule that M cells may die due to the absence of long-range signals that are necessary for their survival [16] when blue I l cells are not present nearby.We adapt this rule as follows: ≥ ξ and 1 B M i 90/2,σ (I l j ) < ν =⇒ death of cell at M j with probability p death per day.
In the same way, we adapt the model rule for xanthophore form transitions from [21] by replacing the deterministic interaction neighborhoods in the original rules with their stochastic equivalents and reframing the rules in terms of ratios: Lastly, we adapt the rules proposed in [21] for iridophore form transitions (between dense and loose) by rescaling by the area of the interaction neighborhoods so that these rules are now in terms of density counts as follows: Table S2: Quantifying variability and breakdown for wild-type zebrafish patterns as a function of additional stochasticity in cell interactions (also see Fig. 6 in the main text).We base these measurements on 1, 000 simulations (for each value of σ) of our adapted version of the model [21], adjusted as described in Summary of Model Rules Adjusted in our Study of Pattern Variability.We choose cell-interaction length-scale parameters randomly each day per cell and interaction from a normal distribution (centered at the default length scale) with increasing standard deviation given by σ times the default length scale.Table S5: Quantifying pattern variability and breakdown for the shady mutant as a function of additional stochasticity in cell interactions (also see Fig. 6 in the main text).We base these measurements on 1, 000 simulations (for each value of σ) of our adapted version of the model [21], adjusted as described in Summary of Model Rules Adjusted in our Study of Pattern Variability.We choose cell-interaction length-scale parameters each day randomly per cell and interaction from a normal distribution (centered at the default length scale) with increasing standard deviation given by σ times the default length scale.

Fig. 3 .
Fig. 3. Adult phenotypes of xanthophore, melanophore and double mutants.pfe ( A, A! ), nac (B-B"), nac;pfe(C,C!) and shd;pfe(D,D !) mutant fish.The boxed areas in A-D are enlarged in A ! -D ! .Arrow in A ! marks X1V in the pfe mutant.In the nac mutant, xanthophores strictly colocalise with dense S-iridophores (B! ); the arrow points to a region of X1V that is enlarged in B" .S-iridophores spread over the entire flank in nac;pfe (C! ).Melanophores associated with the blood vessels are visible along myosepta in shd;pfe(D,D!).

Fig. 2 .
Fig. 2. Details of adult phenotypes in iridophore mutants.Images of the anterior trunk of fixed animals photographed under white light.L-iridophores (black arrows) and dense S-iridophores (white arrows) reflect maximally at different angles of light and therefore shine up in different anteroposterior regions.Insets in A-C demarcate the magnified views in A ! -C ! that were taken under UV light to avoid reflections from iridophores and to highlight xanthophores.Under this illumination, Liridophores are visible as bundles of greyish vertically oriented fibres (black arrows).( A , A ! ) Wild-type fish.( B , B ! ) rsemutant fish.An illumination was chosen to avoid reflection of L-iridophores, whereas S-iridophores are visible (white arrow).L-iridophores, which are largely restricted to melanophore regions in wild type (A,A !, black arrows), expand into interstripe regions in rse mutants (B,B ! ).The light appearance in stripe 1D

Fig. 2 .
Fig. 2. Details of adult phenotypes in iridophore mutants.Images of the anterior trunk of fixed animals photographed under white light.L-iridophores (black arrows) and dense S-iridophores (white arrows) reflect maximally at different angles of light and therefore shine up in different anteroposterior regions.Insets in A-C demarcate the magnified views in A ! -C ! that were taken under UV light to avoid reflections from iridophores and to highlight xanthophores.Under this illumination, Liridophores are visible as bundles of greyish vertically oriented fibres (black arrows).( A , A ! ) Wild-type fish.( B , B ! ) rsemutant fish.An illumination was chosen to avoid reflection of L-iridophores, whereas S-iridophores are visible (white arrow).L-iridophores, which are largely restricted to melanophore regions in wild type (A,A !, black arrows), expand into Fig.1: Self-organization during development: diverse skin patterns form on zebrafish due to the interactions of pigment cells.(A) Wild-type zebrafish feature dark stripes and light interstripes[5,12], while mutant patterns that form because a particular cell type is missing have altered, more variable patterns.(B) Nacre (encoding mitfa)[10,13] has an enlarged central orange region flanked by blue patches.(C) Pfeffer (encoding csf1rA)[10,11,14] is characterized by messy spots arranged horizontally[12].(D) The shady mutant (encoding ltk)[12,15] often features smooth black spots roughly arranged in stripes.(E) During pattern formation, pigment cells extend long legs (measuring up to half a stripe width in distance) toward interstripe cells for communication[27].(F-I) The agent-based model[21] replicates zebrafish patterns in silico.(The central light interstripe is often labeled X0, and the next two interstripes are called X1V and X1D[12].)(J) Rules for agent behavior in the model[21] depend on the cells in short-range disks and a long-range annulus.(K) Summary of the main pigment cells involved in patterning: interstripes consist of orange dense xanthophores and silver dense iridophores, and dark stripes contain yellow loose xanthophores, blue loose iridophores, and black melanophores.Images (A-D) reproduced from Frohnhöfer et al.[12] and image (E) reproduced from Hamada et al.[27] with adaption; images (A-E) licensed under CC-BY 3.0 (http://creativecommons.org/licenses/by/3.0)and published by The Company of Biologists Ltd.Images (F-J) reproduced with minor adaption from Volkening and Sandstede[21] and licensed under CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

r = r 3 r = r 2 r = r 1 r = r 1 Fig. 2 :
Fig. 2: Illustration of persistent homology applied to coordinate data.(A) Noisy data sampled from a figure-eight shape and (B) corresponding manifold expansions.

Fig. 4 :
Fig.4: Baseline quantification of wild-type patterns.All measurements are based on 1, 000 simulations of the model[21] under the default parameter regime.(A) We use persistent homology to detect the presence of breaks in stripes and interstripes.(Following the example in[21], we do not count breaks in the dark stripes along the top and bottom boundaries of the domain.)Red scale bar is 500 µm.(B) Distribution of times at which interstripes X1D and X1V (see Fig.1F) begin to form.(C) Distribution of maximum interstripe width.(D) Distribution of stripe curviness; also see Fig.3C.In (B-D), we display histograms of in silico data and kernel density estimator (KDE) curves with a Gaussian kernel in black; the mean plus/minus the standard deviation is shown in each plot for the data.

Fig. 5 :
Fig. 5: Baseline study of mutant patterns to extract quantifiable features.All measurements are based on 1, 000 simulations of the model [21] (for each mutant) under the default parameter regime.Histograms show distributions for (A) the number of spots, (B) spot size, (C) spot roundness, (D) variance in spot spacing, and (E) X0 interstripe width (see Fig.1G).We show KDE curves with a Gaussian kernel overlaid on the histograms of data; the mean plus/minus the standard deviation is shown in each plot for the data.
noise in spatial !scale of cell interactions (fraction of mean) 0.1 0.

Fig. 6 :
Fig.6: Quantitative study of how stochasticity in cell interactions affects wild-type and mutant zebrafish patterns.For each value of σ ∈ {0.01, 0.05, 0.1, 0.2, 0.3, 0.5}, where σ times the default length scale is the standard deviation of the noise that we include in the size of cellular interaction neighborhoods, we analyze 1, 000 simulations for wild-type and each mutant.Summary of the patterns that emerge under stochasticity, as detected using our methods for (A, E) wild-type, (B, F) nacre, (C, G) pfeffer, and (D, H) shady.In (A-D), we highlight the range of σ values that retain at least 50% characteristic patterns under noise in grey.(We define "characteristic" for wild-type as patterns having 3 unbroken interstripes and 2 unbroken stripes, and we define "characteristic" for mutants as patterns with spot size and spot number that fall within the baseline distributions in Fig.5A-B.)(I) Mean maximum stripe/interstripe width and (J) mean stripe curviness for wild-type for different noise strengths.(K) Spot spacing variance and (L) spot roundness for mutants under different noise strengths.In (I-L), the bars indicate standard deviation and the shaded regions give the characteristic values (the mean plus/minus one standard deviation) for the associated measurements from our default studies.Numbers in (E-H) are given to three significant figures; also see Tables S2-S5.

Fig. 7 :
Fig. 7: Quantifying in silico pattern dependence on the spatial scale of long-range cellular interactions involved in M birth.Kernel density estimates for (A-B) maximum stripe and interstripe width for wild-type, (C) wild-type stripe curviness, (D) number of spots for pfeffer and shady, (E) median spot size for the mutants, and (F) pfeffer and shady spot roundness as a function of the inner radius of the Ω long neighborhood in (1).Measurements in (A-F) are based on 100 simulations of the model [21] (for wild-type, pfeffer, and nacre, respectively) for each inner radius R of Ω long in (1) considered.(We consider R from 10 to 400 µm in increments of 25 µm.)All other model parameters (including the width of the Ω long annulus in (1) and the long-range annulus scale in all other model rules) remain at their default values.In (A), (B), and (E) we show linear regression models for their corresponding values, along with the R 2 goodness-of-fit scores.(G) Example wild-type, pfeffer and shady patterns for different parameter values (the patterns generated by the model [21] under the default parameter -210 µm -are noted in gray).

Fig. S2 :
Fig.S2: Illustration of persistent homology applied to coordinate data and corresponding barcode and persistence diagrams.For easier comparison with our barcode and persistence diagrams, here we show (A) original coordinate data sampled from a figure-eight shape and (B-E) corresponding manifold expansions given by {b r (x i )} N i=1 = {y ∈ D, i ∈ [1, N ] : d D (x i , y) ≤ r} for r = r 1 < r 2 < r 3 < r 4 (also seeFig. 2 in the main text).(F-G) We show the dimension 0 and dimension 1 barcode diagrams that correspond to persistent homology applied to the figure-eight data.The long bar in panel (F) represents the single connected component of the figure-eight shape, and the two long bars in (G) correspond to the two loops of the figure-eight shape.(H) As an alternative means of viewing persistent homology results, we also show the analogous persistence diagram.The circular point in teal in the top left corner of the diagram represents the persistent dimension 0 feature (namely, the single connected component in our figure-eight data); and, the two triangular points in pink in the top left corner of the diagram correspond to the two persistent dimension 1 features (namely, the two loops in our figure-eight data).We calculated persistent homology using Ripser[54].
we show the dimension 0 and dimension 1 barcode diagrams corresponding to the figure-eight example in the main text (see Fig. 2 in the main text).We show the analogous persistence diagram for this example in Fig. S2H.The long bar in Fig. S2F represents the single connected component of the figure-eight shape and the two long bars in Fig. S2G correspond to the two loops of the figure-eight shape.In the persistence diagram, these topological features are represented by the top left circular teal point in Fig. S2H (dimension 0 feature) and the two triangular pink points in the top left corner in Fig. S2H (dimension 1 features).
neighboring agents provides an additional measurement of pattern quality: CV = 100 × std(agent-agent distances) mean(agent-agent distances).
Fig. S4: Measurements of cell density.We show distribution plots for (A) X l -M and (B) I l -M density counts.We base these distributions for wild-type, pfeffer, and shady patterns on 1, 000 simulations of the model [21] under the default parameter regime.(We do not show distributions for nacre because this mutant lacks melanophores.)We indicate the mean plus/minus the standard deviation of these cell-cell density measurements in each figure.