Estimating the reproduction number and transmission heterogeneity from the size distribution of clusters of identical pathogen sequences

Significance For many infectious diseases, a small fraction of individuals has been documented to disproportionately contribute to onward spread. Characterizing the extent of superspreading is a crucial step towards the implementation of efficient interventions. Despite its epidemiological relevance, it remains difficult to quantify transmission heterogeneity. Here, we present an inference framework harnessing the size of clusters of identical pathogen sequences to estimate the reproduction number and the dispersion parameter. We also show that the size of these clusters can be used to estimate the transmission advantage of a pathogen genetic variant. This work provides crucial tools to better characterize the spread of pathogens and evaluate their control.


2013-2015 Middle East respiratory syndrome outbreak
Between March 2012 and 31 December 2015, 1,646 cases were identified globally (49).We analyzed 174 aligned MERS-CoV sequence data sampled in humans between 2013 and 2016 analyzed in Dudas et al (25) and made available at (50).This set of sequences thus corresponded to 10.6% of all detected cases.For the analysis, we explored scenarios where: (i) all infections were detected(corresponding to an infection sequencing ratio of 10.6%) (ii) half of infections were detected (corresponding to an infection sequencing ratio of 5.3%) For the MERS analysis, we used a threshold  !"# for the size of cluster of 10,000.Increasing this value to 50,000 did not impact our estimates.

2017-2018 measles outbreak in the Veneto Region (Italy)
We used data describing the 2017-2018 measles outbreak in the Veneto Region, located in the North-East of Italy (4).We used 30 sequences (out of 322 suspected cases) analyzed in Pacenti et al (4).Sequences were aligned using the Nextstrain measles workflow (5)(6)(7).This set of sequences thus corresponded to 9.3% of detected cases.For the analysis, we explored scenarios where: (i) all infections were detected (corresponding to an infection sequencing ratio of 9.3%) (ii) half of infections were detected (corresponding to an infection sequencing ratio of 4.7%) For the measles analysis, we used a threshold  !"# for the size of cluster of 10,000.Increasing this value to 50,000 did not impact our estimates.

COVID-19 pandemic in New-Zealand
We analyzed 27,565 SARS-CoV-2 sequences from New Zealand downloaded from the GISAID EpiCoV database on December 8 th , 2022 (8,9) and curated using the Nextstrain nCoV ingest pipeline (10).Clusters of identical sequences were generated using the approach detailed below and grouped by time period based on the collection date of the earliest sequence within the cluster.
We estimated the offspring distribution of COVID-19 in New Zealand during the Zero COVID era (April 2020 -July 2021).We assumed that the dispersion parameter k remained constant throughout the period but allowed the reproduction number R to vary between time periods (April-May 2020, June-December 2020, January-April 2021 and May-July 2021).As a baseline scenario, we considered that throughout this period, 80% of infections were detected.Since autochthonous transmission remained extremely limited during this period, it is indeed unlikely that a large fraction of infections was undetected by the surveillance system.As a sensitivity analysis, we also explored scenarios where 50% and 100% of infections were detected.For each time period, the fraction of cases sequenced was estimated as the ratio of the number of sequences collected and of the number of cases (11) reported during this time period.
For this analysis, we used a threshold  !"# .for the size of cluster of 10,000.Increasing this value to 50,000 did not impact our estimates.
As k estimates might be difficult to interpret, we computed the expected proportion of individuals contributing to 80% of transmission events.This was done for maximum-likelihood estimates of the dispersion parameter and the reproduction numbers across the 4 periods (central estimates) and for the bounds of 95% likelihood profile confidence interval for k (while using maximum likelihood estimates for R).

COVID-19 pandemic in Washington state
We analyzed 140,790 SARS-CoV-2 sequences from Washington state (United States of America) downloaded from the GISAID EpiCoV database on December 8 th , 2022 (8, 9) and curated using the Nextstrain nCoV ingest pipeline (10).As for the New-Zealand analysis, we allocated a date to each cluster of identical sequences based on the collection date of the earliest sequence within that cluster.
We applied our transmission advantage framework to the following variants: D614G, Epsilon, Alpha, Delta, Omicron BA.1, Omicron BA.2 and Omicron BA.4/BA.5.We defined variant-specific study periods beginning on the day when at least 10 variant sequences have been collected (cumulative) and lasting between 1 and 60 days (Table S8).For the Omicron BA.1 analysis, we only considered time-windows of 50 days maximum, to restrict analyses before the spread of BA.2.For each window of analysis, we selected the clusters of identical SARS-CoV-2 sequences who started during this period.We generated the size distribution of clusters of identical sequences for both the variant and non-variant genetic sub-populations (Figure S12) and only considered clusters who were initiated during this time window.From this, we applied tour transmission advantage inference framework and computed a p-value associated with the statistical test defined by the following null hypothesis: H0: There is no difference in the reproduction number of the variant and non-variant.
This was done by accounting for the fraction of cases sequenced in Washington state during these different time periods and exploring different assumptions regarding the fraction of infections detected.We also accounted for the shorter generation time associated with Omicron variants (12,13), which translated on the probability that an infector and an infectee have the same consensus sequence (Table S2).

Supplementary methods -Estimating the probability that a transmission event occurs before a substitution event
The generation time can often not be approximated using an exponential distribution (e.g. its variance is generally not equal to its mean).We relaxed this assumption and empirically derived the probability that a transmission event occurs before a substitution one using a simulation approach for the following pathogens: mumps virus, MERS-CoV, SARS-CoV, Ebola virus, mpox during the 2022-2023 outbreak, measles virus, RSV, Zika virus, influenza A (H1N1pdm and H3N2) and SARS-CoV-2.For SARS-CoV-2, we accounted for a reduction in the generation time for Omicron variant (49,50).For each pathogen, we identified relevant parameters describing the mean and the standard deviation of the generation time of these pathogens.The generation time describes the average duration between the infection time of an index case and the time at which this primary case infects a secondary case.We then drew nsim = 10 7 generation intervals from a Gamma distribution parametrized with the same mean and standard deviation.We also drew nsim = 10 7 delays until the occurrence of a first substitution for these different pathogens from an exponential distribution of rate the substitution rate obtained from the literature.We then computed the proportion of simulations for which the generation time was shorter than the delay until the occurrence of a first substitution to obtain an estimate of the probability that transmission occurs before substitution.
We explored how uncertainty around the mean generation time and the pathogen's substitution rate impacted estimates of p. Uncertainty in the substitution rate (respectively the mean generation time) was accounted for by fixing the mean generation time (respectively the substitution rate) to the central estimate and using the lower and upper bound for the substitution rate (respectively the generation time) reported in the different studies.From this, we obtained 4 values for the probability that transmission occurs before substitution.We reported the lowest and the highest of these 4 estimates as our lower and upper bound estimates for p.A The uncertainty range for the MERS-CoV substitution rate was obtained by subtracting and adding the standard deviation reported in (16) to the central estimate.
B The uncertainty range for the measles generation time was obtained by considering the range of values reported for the mean measles generation time in (17).
C The uncertainty range for the measles' substitution rate was obtained by subtracting and adding the standard deviation obtained with the Nextstrain measles workflow (32).D We did not explore any uncertainty around the SARS mean generation time (no estimates found).
E For Omicron, we assumed that the mean generation time was one day shorter than for pre-Omicron variants (12,13) and considered that it was characterized by the same standard deviation.S6: Parameter estimates for SARS-CoV-2 in New Zealand under our lower bound estimate for the probability p that transmission occurs before substitution.Maximum likelihood estimates are reported along 50% and 95% confidence intervals (CI).

Proportion of infections detected as cases
Table S7: Parameter estimates for SARS-CoV-2 in New Zealand under our upper bound estimate for the probability p that transmission occurs before substitution.Maximum likelihood estimates are reported along 50% and 95% confidence intervals (CI).   is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.The simulations were run assuming that 50% of infections were sequenced.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.

Strain name Accession number
Figure S3: Relative bias on the dispersion parameter k estimate when the reproduction number lies below the threshold of 1/p.For each true value of the dispersion parameter k (xaxis) and value of the probability p that an infector and an infectee have the same consensus sequence, the boxplot depicts the distribution of the relative bias across 100 simulations for different dataset sizes (colours).The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true dispersion parameter used to generate synthetic cluster data and  $%& our maximum likelihood estimate.The simulations were run assuming that 50% of infections were sequenced and for a true reproduction number of 1.0.The y-axis was cropped at 2 to increase readability.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.is the true dispersion parameter used to generate synthetic cluster data and  $%& our maximum likelihood estimate.The simulations were run assuming that 50% of infections were sequenced and for a true reproduction number of 3.0.The y-axis was cropped at 9 to increase readability.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.

Figure S5: Impact of reaching the reproduction number threshold on dispersion parameter estimates.
The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true dispersion parameter used to generate synthetic cluster data and  $%& our maximum likelihood estimate.The boxplots depict the 2.5%, 25%, 50%, 75% and 97.5% percentiles of relative bias obtained across all the simulations we performed and that are detailed in the methods section.For each true value of the reproduction number R (x-axis) and different dataset sizes (different subplots), the boxplots depict the distribution of the relative bias across 100 simulations for different proportion of infections sequenced (colours).The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.The simulations were run assuming that 50% of infections were sequenced.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.Results are reported for a probability that an infector and an infectee have the same consensus sequence of 50% and a dispersion parameter value of 0.1.The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.S8).We considered that clusters of identical sequences fell into the time-window if they were first detected during that time window.

Proportion of transmission pairs with identical consensus genomes from the simulations
For each of these scenarios, we computed the fraction of transmission pairs with identical consensus sequences and compared this to the theoretical probability that a transmission event occurs before a substitution one (dashed horizontal lines in Figure S18).For narrow transmission bottlenecks of size 1, the theoretical probability value approximates well the proportion of pairs with identical consensus sequences.As transmission bottleneck size widens, this theoretical probability no longer approximates well the proportion of pairs with identical consensus sequences.When simulating the spread and evolution of a pathogen characterized by a longer generation time (200 days), we found that the proportion of transmission pairs with identical consensus genomes differed slightly from that expected from the theoretical probability that a transmission event occurs before a substitution event.
To conclude, this simulation study shows that the probability that transmission occurs before substitution is a good proxy for the proportion of pairs with identical consensus genomes for pathogens characterized by narrow transmission bottlenecks, relatively short infectious durations and limited within host-diversity.
Figure S18: Proportion of transmission pairs with identical consensus sequences exploring different assumptions regarding the transmission bottleneck size and the disease generation time ("inf.duration" in days).Vertical segments correspond to 95% confidence intervals.Each point was obtained by generating 1,200 transmission pairs.The horizontal dotted lines correspond to the probability that a transmission event occurs before a substituion event.

Within-host diversity
Our simulation framework makes a number of simplifying assumptions regarding the within-host mutation process and the pathogen's population dynamics.We checked whether our simulations were associated with reasonable outputs in terms of within-host diversity.We computed the nucleotide diversity as the mean number of single nucleotide polymorphism (SNP) differences per site across scenarios (Figure S19).
As expected, we found that within-host pairwise nucleotide diversity increases with both the duration of infectiousness and the size of the transmission bottleneck.For pathogens characterized by narrow transmission bottleneck (1 to 2) and short generation times, we obtained nucleotide diversity at the genome level of the order of 10 -5 to 10 -4 SNPs per site.This fits within the range of estimates obtained across RNA viruses causing acute infections (10 -6 to 10 -4 for SARS-CoV-2 (33,35), 10 -5 for RSV (36,37), 10 -4 for DENV-1 (38), 10 -5 for influenza A viruses (39)).

Supplementary text B -Inference of transmission parameters conditional on cluster extinction
In the main text, we showed that our inference framework provides unbiased estimates of both R and k when the mean number of offspring with identical sequences is lower than 1 ( ⋅  < 1), corresponding to situations where cluster extinction is almost certain.Our method however becomes unreliable when the probability of cluster extinction is strictly lower than 1.This was done relying on the full distribution of clusters of identical sequences, including those that did not go extinct (which were then set to an arbitrary high threshold value).An alternative approach would consist in looking at cluster sizes conditional on extinction (40,41).Previous theoretical work has indeed shown that a supercritical epidemic process where extinction is uncertain (characterized by R > 1) can be mapped to a subcritical counterpart characterized by a mean number of offspring lower than 1 (R < 1) and the same dispersion parameter (40).
In the following paragraphs, we show how our inference framework could be adapted to look at the size of clusters of identical sequences conditional on them having gone extinct.We then evaluate the performance of this adapted statistical framework and highlight some remaining challenges for real-world applications.

Distribution of the size of clusters of identical sequences conditional on extinction
We used the formalism introduced by Waxman and Nouvellet (40) to describe the size distribution of clusters of identical pathogen sequences conditional on extinction.They showed that, conditional on extinction, supercritical and subcritical dynamics (respectively characterized by a reproduction number below and above 1) cannot be distinguished.Waxman and Nouvellet had characterized the size of finite disease outbreaks.Here, we instead consider finite mutation-less outbreaks, i.e. clusters of infected individuals characterized by the same pathogen sequence.
Let  denote the probability of extinction for a cluster of identical pathogen sequences.If the mean number of offspring with identical sequences is lower than 1,  is equal to 1. Otherwise,  is lower than 1.Following Waxman and Nouvellet, we introduce  0 , as the reproduction number associated with clusters of identical sequences that got extinct.In the following, we refer to  0 as the subcritical reproduction number.We have the following relationship between the reproduction number  and  0 : where  is the dispersion parameter of the offspring distribution.Figure S20 depicts how the subcritical reproduction number  0 is impacted by the reproduction number , the dispersion parameter  and the probability that an infector and an infectee have the same consensus sequence.As Waxman and Nouvellet note, the subcritical reproduction number  0 mirrors the supercritical one .This means that inferring the subcritical reproduction number  0 enables to infer the reproduction number  as there is a direct relationship between the two (Figure S20).The probability  2 *#'345' for a cluster of identical sequences to be of size  conditional on extinction is equal to (40,41): and the probability  2 for a cluster of identical sequences of being of size  is thus equal to: More specifically, we note that  0 < 1/.In the specific situation where  < 1/, we have  =  0 and  2 *#'345' =  2 .

Inference from the size of clusters of identical sequences conditional on extinction
Assuming we have a dataset comprised of the size of clusters of identical sequences that got extinct, we can hence infer the value of the subcritical reproduction number  0 and the dispersion parameter by using the updated formula ( * * ) for the probability of cluster of identical sequences of being of size  in the derivation of the likelihood.Implementing this updated framework, we then obtained maximum likelihood estimates of  0 and  by imposing values of the reproduction number ranging between 0.01 and 1/ and values of the dispersion parameter ranging between 0.001 and 10.0.
We evaluated our inference framework on synthetic cluster data generated using a branching process with substitution (see main text).Clusters were simulated until reaching a maximum size of 10,000.We then considered that clusters who reached 10,000 had not gone extinct and applied our inference framework to the subset that got extinct (size < 10,000).Figure S21-S23 shows that we were able to recover the expected value of the dispersion parameter and the subcritical reproduction number  0 .
Assuming prior knowledge on whether the reproduction number lies above or below the threshold of 1/, the estimated subcritical reproduction number can either directly be interpreted as the reproduction number of the outbreak (below the threshold) or mapped to a corresponding reproduction number higher than 1/ (using equation ( * ), Figure S20)

Challenges for the application to real-world data
In the previous paragraphs, we introduced an alternative approach to characterize the disease offspring distribution when the mean number of offspring with identical genomes is higher than 1.
By restricting the analysis to clusters of identical sequences that got extinct, we showed that we could accurately infer the reproduction number and the dispersion parameter.
However, we acknowledge that determining in practice whether a cluster of identical sequences has become extinct may be challenging.Furthermore, we assumed here that the epidemiological process under study was stationary (i.e. that the reproduction number and the dispersion parameter are constant throughout the study period).In practice, behaviour changes, the implementation of interventions or the depletion of the susceptible population as the epidemic progresses can modify the effective reproduction number.This is likely especially problematic for reproduction numbers greater than 1 (and by extension above the threshold of 1/).Overall, further work is warranted to estimate the offspring distribution's parameters above the threshold of 1/ from real-world data describing the size of clusters of identical sequences.

Figure S1 :
Figure S1: Dynamics of extinction for clusters of identical pathogen sequences.A. Proportion of clusters of identical sequences that go extinct as a function of the reproduction number R (x-axis) exploring different assumptions regarding the dispersion parameter k (colored lines) and the probability p that an infector and an infectee have the same consensus sequence.B. Mean number of generations until cluster extinction (among clusters that go extinct) extinct as a function of the reproduction number R (x-axis) exploring different assumptions regarding the dispersion parameter k (colored lines) and the probability p that an infector and an infectee have the same consensus sequence.The vertical red dashed lines correspond to the inverse of the probability p that that an infector and an infectee have the same consensus sequence.

Figure S2 :
Figure S2: Relative bias on the reproduction number R estimate when the reproduction number lies below the threshold of 1/p.For each true value of the reproduction number R (xaxis) and value of the probability p that an infector and an infectee have the same consensus sequence, the boxplot depicts the distribution of the relative bias across 100 simulations for different dataset sizes (colours).The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.The simulations were run assuming that 50% of infections were sequenced.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.

Figure S4 :
FigureS4: Relative bias on the dispersion parameter k estimate when the reproduction number lies above the threshold of 1/p.For each true value of the dispersion parameter k (xaxis) and value of the probability p that an infector and an infectee have the same consensus sequence, the boxplot depicts the distribution of the relative bias across 100 simulations for different dataset sizes (colours).The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true dispersion parameter used to generate synthetic cluster data and  $%& our maximum likelihood estimate.The simulations were run assuming that 50% of infections were sequenced and for a true reproduction number of 3.0.The y-axis was cropped at 9 to increase readability.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.

Figure S6 :
Figure S6: Impact of the proportion of infections sequenced on the relative bias on the reproduction number R estimate.Results are reported for a probability that an infector and an infectee have the same consensus sequence of 50% and a dispersion parameter value of 0.1.For each true value of the reproduction number R (x-axis) and different dataset sizes (different subplots), the boxplots depict the distribution of the relative bias across 100 simulations for different proportion of infections sequenced (colours).The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.The simulations were run assuming that 50% of infections were sequenced.The boxplots represent the 2.5%, 25%, 50%, 75% and 97.5% percentiles.

Figure S7 :
Figure S7: Relationship between the proportion of singletons among clusters analyzed and the relative bias on the reproduction number R estimate.Different assumptions regarding the proportion of infections sequenced (columns) and the size of the dataset on which the inference was run (rows) are explored.Points are coloured by true reproduction number value.Results are reported for a probability that an infector and an infectee have the same consensus sequence of 50% and a dispersion parameter value of 0.1.The relative bias is defined as ( $%& −  '()* )/ '()* where  '()* is the true reproduction number used to generate synthetic cluster data and  $%& our maximum likelihood estimates.

Figure S8 :
Figure S8: Sensitivity analysis exploring how R and k estimates for MERS and measles are impacted by assumptions regarding the probability p that an infector and an infectee have the same consensus sequence.Estimates of A. the reproduction number R and B. the dispersion parameter k for MERS.Estimates of C. the reproduction number R and D. the dispersion parameter k for measles during the 2017-2018 Italy outbreak.

Figure S9 :
Figure S9: Sensitivity analysis exploring how R and k estimates for SARS-CoV-2 in New Zealand are impacted by assumptions regarding the probability p that an infector and an infectee have the same consensus sequence.Estimates of A. the reproduction number R and B. the dispersion parameter k for SARS-CoV-2 in New Zealand.

Figure S10 :
Figure S10: Transmission advantage bias as a function of the true transmission advantage and varying the probability that an infector and an infectee have the same consensus sequence (rows) and the reproduction number of the non-variant RNV (columns).Each subplot corresponds to a given assumption regarding the probability that an infector and an infectee have the same consensus sequence and the reproduction number of the non-variant.In each subplot, the vertical dashed line corresponds to the limit from which the reproduction number of the variant RV reaches the threshold of 1/p.Vertical dashed lines before the 10% x-axis tick correspond to situations where the reproduction number of the non-variant RNV is also above the threshold of 1/p.

Figure S11 :
Figure S11: Impact of accounting for different genetic subpopulations on estimates of the dispersion parameter for different assumptions regarding the true dispersion parameter (rows) and different dataset sizes (columns)In each subplot, the horizontal dashed grey line corresponds to the true dispersion parameter value used to generate synthetic clusters of identical sequences.The boxplots summarize the 2.5%, 25%, 50%, 75% and 97.5% percentile of maximum-likelihood estimates obtained across 100 simulated datasets.

Figure S12 :
Figure S12: Size distribution of clusters of identical SARS-CoV-2 sequences in Washington state split by variant of interest.For each variant that we studied, we displayed the distribution of cluster sizes for the variant and the non-variant considering different time window length (1 to 5 weeks).The time windows begin when the cumulative number of collected in Washington state variant sequences reached 10 (See TableS8).We considered that clusters of identical sequences fell into the time-window if they were first detected during that time window.

Figure S13 :
Figure S13: Sensitivity analysis varying our assumption regarding the fraction of infection detected as cases (different panels) on the p-values for variant transmission advantage in WA state.P-values over time since collection of 10 variant sequences for different SARS-CoV-2 variants during the COVID-19 pandemic in Washington state exploring different assumptions regarding the fraction of infections detected (columns).We considered maximum likelihood estimated (MLE) to be consistent with a variant transmission advantage if the estimated reproduction number of the variant was higher than that of the non-variant.

Figure S14 :
Figure S14: Comparison between estimates of p obtained from household transmission pair data and from assumptions regarding the evolutionary rate and the generation time.Vertical segments correspond to our uncertainty ranges around p estimates.Horizontal segments correspond to 95% binomial confidence intervals around the proportions obtained from transmission pair data.

Figure S15 :
Figure S15: Comparison of the distribution of the time to occurrence of a first substitution and the time to transmission for different pathogens.For each pathogen, we additionally report the estimated probability p that transmission occurs before substitution.Here, we depict the simulations corresponding to the central estimate for the substitution rate and the generation time.

Figure S16 :
Figure S16: Difference between identical sequences obtained from the distance matrix and the reconstructed clusters of identical sequences for MERS-CoV sequences.Each vertex corresponds to a MERS-CoV sequence.Vertices are connected if their pairwise distance is equal to 0. Vertices have the same colour if they were allocated to the same cluster of identical sequences.The clusters for which there is a disagreement between the distance matrix and the cluster allocation (i.e. when some identical sequences are not in the same cluster) are circled.For clarity, we only displayed sequences with at least one other identical sequence in the pairwise distance matrix.

Figure S17 :
Figure S17: Relationship between the frequency of a mutant in the donor (infector) and the recipient (infectee) of a transmission pair.Different assumptions regarding the transmission bottleneck size (different columns) and the generation time (in days -different rows) are explored.

Figure S19 :
Figure S19: Nucleotide diversity for different infectious durations and transmission bottleneck size.Nucleotide diversity was computed as the mean within-host SNP difference per site.Boxplots indicate the 2.5%, 25%, 50%, 75% and 97.5% quantiles.Vertical whiskers going to the bottom of the plot correspond to 0 values.

Figure S20 :
Figure S20: Relationship between the reproduction number R and the subcritical reproduction number Rs for different probabilities p that an infector and an infectee have the same consensus sequence and different values of the dispersion parameter k.Colored lines correspond to reproduction numbers lying above the threshold of 1/p.The dashed grey lines correspond to reproduction numbers lying below the reproduction number threshold (for which the reproduction number is equal to the subcritical reproduction number).

Figure S21 :
Figure S21: Estimates of the dispersion parameter k using the size of clusters that got extinct.Estimates are reported as a function of the true reproduction number R used to generate synthetic clusters.Point estimates correspond to maximum-likelihood estimates and vertical segments to 95% likelihood profile confidence intervals obtained from analyzing 1000 synthetic clusters of identical sequences.

Figure S22 :
Figure S22: Dispersion parameter estimates as a function of the true reproduction number when using the inference framework relying on cluster size distribution conditional on cluster extinction. A. Using a dataset comprised of 1,000 clusters of identical sequences.B.Using a dataset comprised of 5,000 clusters of identical sequences.Each boxplot represents the distribution of k maximum likelihood estimates across 100 simulations (2.5%, 25%, 50%, 75% and 97.5% percentiles).We explored different values of the true dispersion parameter k (boxplot contour colours) and different values for the probability p that an infector and an infectee have the same consensus sequence (boxplot filling).

Figure S23 :
Figure S23: Subcritical reproduction number Rs estimates as a function of the true reproduction number when using the inference framework relying on cluster size distribution conditional on cluster extinction.Each boxplot represents the distribution of Rs maximum likelihood estimates across 100 simulations (2.5%, 25%, 50%, 75% and 97.5% percentiles).Results are displayed for a true dispersion parameter of 0.1 and running the inference on 1,000 clusters of identical sequences.Each panel corresponds to a different assumption regarding the probability p that an infector and an infectee have the same consensus sequence.The horizontal dashed segments correspond to the true value of Rs (associated with the true reproduction number and the true dispersion parameter).

Table S3 :
Parameter estimates for MERS.Maximum likelihood estimates are reported along 50% and 95% confidence intervals (CI).

Table S4 :
Parameter estimates for measles.Maximum likelihood estimates are reported along 50% and 95% confidence intervals (CI).

Table S5 : Parameter estimates for SARS-CoV-2 in New Zealand under our central estimate for the probability p that transmission occurs before substitution.
Maximum likelihood estimates are reported along 50% and 95% confidence intervals (CI).

Table S8 : Definitions of the study periods for the Washington state SARS-CoV-2 analysis.
Dates are reported in a YYYY-MM-DD format.

Table S9 : Genbank accession numbers for measles sequences used in the analysis.
All sequences were obtained from Pacenti et al. using the Nextstrain measles workflow (4, 7).