Cystic fibrosis carriers are at increased risk for a wide range of cystic fibrosis-related conditions

Significance Cystic fibrosis (CF) carriers are at increased risk for most of the conditions that commonly occur in people with CF. Given that there are more than 10 million CF carriers in the United States alone, the morbidity attributable to the CF carrier state is likely substantial. Thus, identifying CF carriers may aid in the prevention, diagnosis, and treatment of several common and uncommon disorders.

To identify cystic fibrosis-related conditions, we performed a literature search designed to find literature reviews describing conditions and symptoms related to CF. The following search was used to identify a total of 1,364 review papers in PubMed. From this list, all abstracts were reviewed. We excluded papers focused on treatment, screening, management or diagnosis of CF and refined this list to include a total of 122 papers to review in detail.  Our goal was to investigate conditions related to CFTR mutations. Thus, we excluded some conditions for the following reasons. First, we excluded conditions indirectly related to CF. For example, we excluded urinary incontinence and rectal prolapse as they are most likely related to coughing and constipation, respectively, and not directly related to CFTR mutations. We excluded pulmonary hypertension for the same reason. Similarly, we excluded depression and related conditions as they are more common among patients with chronic illnesses in general. Second, we excluded conditions that we considered to be a result of CFrelated care or exposure to healthcare: Clostridium difficile colitis, acute and chronic kidney injury, fibrosing colonopathy, and the MRSA-carrier state. Third, we excluded conditions without robust ICD-9-CM/ICD-10-CM codes (e.g., Burkholderia cepacia complex, small intestine bacterial overgrowth). Finally, we excluded female infertility from our condition list. The risk was reported to be elevated among CF patients, and we found increased risk among CF carriers, however, we think that women with infertility may be more likely to have CF screening, and thus be diagnosed as a CF-carrier as part of their prenatal care.

Methods S2: Methods used to simulate the effect of cases of CF misclassified as CF carriers.
We performed a simulation analysis to determine if the results observed in our study could be explained by cases of CF that were misclassified as CF carriers due to rare CFTR mutations not detectable by standard screening panels. The objective of our simulation analysis was to test the following (null and alternative) hypotheses: We performed three basic simulation analyses. First, we used a bootstrapping approach to compute p-values corresponding to a misclassification rate that would be expected based on population prevalence of CF and CFTR mutations, and screening accuracy for standard CFTR panels. A description of how expected misclassification rates were calculated can be found below. For each condition, we performed 100,000 simulation trials and computed the number of trials that produced odds ratios greater than those reported in Figure 2. Results of this simulation analysis can be found in Table S3.
Second, we estimated the average misclassification rate and number of misclassified cases that would be necessary to generate the results reported in our study. For each condition, we performed a grid search over various misclassification values to identify the misclassification rate that, on average, produced odds ratios equal to those reported in Figure 2. Specifically, we varied between 0 and 1, in 0.001 increments. For each value we ran 2,000 simulated trials and computed the average estimated odds ratio. We then used these values to estimate the misclassification rate needed to generate our results. Results of this simulation analysis can also be found in Table S3.
Third, we performed a simulation to test if the results we obtained across all conditions could be explained by misclassification. Specifically, we estimated the total number of conditions, out of the 59 evaluated, that we would expect to generate similar estimates given expected misclassification rates. For each simulation trial, we estimated the odds ratio across all of the 59 conditions evaluated. We then computed the total number of conditions for which the estimated odds ratios were greater than or equal to the values reported in Figure 2. We performed one million trials of this final simulation, for both the upper and lower bound on the expected misclassification rates (described below). Results of this simulation are summarized in Table S4.
Simulation Algorithm. The following steps summarize the algorithm that was used for each of the simulations described above. For a given condition, misclassification rate , and number of trials n, the following was performed: 1. Take a parameter ∈ [0,1] representing the fraction of CF carrier cases we expect to be misclassified (i.e., these are actually cases of CF) and create a simulated CF carrier cohort, containing misclassified cases of CF and CF carriers that are identical to control patients. a. Misclassified CF cases: Randomly replace a fraction of the CF carriers with known CF cases, matched based on age, sex and enrollment time. b. Simulated CF carriers with no effect: Randomly replace the remaining fraction (1 − ) of the CF carriers with control cases (without CF or CF-carrier markers), drawn randomly from the cohort of patients not included in the study population and matched on age, sex and enrollment time.
2. Compute the odds ratio for a given condition between the original control cohort and the simulated CF carrier cohort.
3. Repeat, n-times per value of .
4. Compute the percentage of times that the simulated CF carrier cohort produced an odds ratio "as extreme" as the odds ratio obtained in our original cohort (p-value) or return the average estimated odds ratio (and corresponding percentiles) across all trials.
Note on drawing matched CF cases: In step 1a) some of the age-sex-enrollment strata contained fewer CF cases than CF carriers, and some contained no CF cases. For older CF carriers (e.g., age > 40), it may be harder to find exact age-sex-enrollment matches because of the decreased life expectancy associated with CF. Thus, for any strata where fewer CF cases could be identified than the number of CF carriers, we implemented the following strategy to identify CF cases that were as closely matched as possible. For a given strata, we performed the following: 1. If the number of CF cases was greater than or equal to the number of CF carriers in a given strata, we drew CF cases without replacement for step 1A from the above algorithm. 2. If the number of CF cases was less than the number of CF carriers and greater than 0 in a given strata, we drew CF cases with replacement if all CF cases were drawn. For example, if a given strata contained 15 CF carriers and only 8 CF cases, and during a single simulation trial 9 CF carriers were selected to be replaced with CF cases, we then drew the 9 th CF case with replacement. 3. If the number of CF cases was 0 in a given strata, we looked for matches in the following order: a. First, we relaxed the constraint that CF cases have a CF diagnosis in at least 2 outpatient visits and instead looked for exact age/sex/enrollment matches among all enrollees that were diagnosed with CF during at least one inpatient or outpatient visit. If at least one exact match could be found, we proceeded according to the steps 1 or 2, as outlined above. If no exact matches could be found, we proceeded to the following step. b. We looked for CF cases, among enrollees with any CF diagnosis that were closest in age but still had the same sex and enrollment time. This final criterion allowed us to identify matches for all enrollees using an average age threshold of 1.6 years.
Simulation Parameters. We assumed that CF screening panels typically capture 80-90% of genetic mutations. 123,124 Thus, a CF carrier would be correctly identified with probability 80-90%. Similarly, a CF case would be correctly identified with probability 64-81% (e.g., 0.8 2 to 0.9 2 ) but would be labeled as a CF carrier with probability 18-32% (i.e., 2*0.9*0.1 to 2*0.8*0.2). We also assumed that the likelihood of being a CF carrier was approximately 1/37 while the probability of CF was 1/2500. 125 Thus, using Bayes Theorem we would expect approximately 0.295-0.590% of our observed CF carrier cohort, or approximately 58 to 117 enrollees, to be misclassified patients with CF. Note: although these values depend on the demographic information not contained in our enrollment data (e.g., race and ethnicity), previous estimates of carrier risk following negative test results have been reported between 1/380 (0.263%) for Ashkenazi Jewish populations and 1/170 (0.588%) for African American populations. 123 Thus, the parameter values that we use to bound our simulation analysis, namely 0.295-0.590%, entirely encapsulate the range of expected misclassification rates for individuals of different races or ethnicities.
The calculated rate of expected misclassification can be summarized by the following formulas: (1 ) = ( ) * (1 | ) + ( ) * (1 | ) Thus, for a screening mutation detection rate of 80% we would expect a misclassification rate of:
Our primary analysis involves the testing of multiple hypotheses across various CF-related conditions; thus, we performed a sensitivity analysis for the number of estimates that might be attributable to false discovery. Because many of the conditions and organ systems that were analyzed are inter-related, we used a simulation analysis to estimate an empirical rate of false discovery. Similar to the analysis described in Methods S2, we performed analysis by building multiple simulated cohorts of non-carriers then repeating our study analysis. For a given simulation trial, we first replaced each carrier with a randomly drawn non-carrier with the same sex, enrollment period and age. We then re-drew the matched control cohort, using the same criteria described in the study. Finally, we repeated our primary prevalence analysis and computed the total number of conditions that had p-values less than, or equal to, those reported in Table 2. We performed 10,000 trials for this simulation, and results are summarized in Table  S5.

Methods S4: Methods used to identify and analyze validation carrier cohort.
We first identified all children with a diagnosis of CF using the diagnosis codes reported in Table  S1. We used the first child diagnosed with CF as the index diagnosis, in households with multiple children with CF. We also eliminated families where either parent was diagnosed with CF. Next, we identified mothers (female enrollees listed as either the primary beneficiary or spouse) within the child's family. To better ensure genetic maternity, we eliminated all mothers whose observation period did not overlap with the child's birth. Finally, we identified the point where the child, or any child, was first diagnosed with CF and then truncated the mother's observation period to the time from her first enrollment to the point where her first child with CF was diagnosed.

Selection of matched controls.
In order to identify matched controls in the most consistent manner possible, we first matched all CF carriers (i.e., mothers) to controls based on sex, months of total enrollment and age over the entire study period. However, because we restricted analysis to the period before the child was diagnosed with CF, we truncated the observation period for both CF carriers and matched controls to the same time span prior to the CF diagnosis. For example, suppose a carrier mother was followed for 24 months, but the child's first CF diagnosis occurred 9 months after the start of the enrollment period. We started by matching this carrier to 5 controls that could be followed for 24 months. We then truncated the observation period for both the carrier and matched controls to the first 9 months of enrollment. We opted to perform matching on the full enrollment period followed by truncation, rather than matching on the truncated period in order to control for possible differences in individuals followed for different lengths of time. For example, enrollees with shorter enrollment windows may differ in meaningful ways from enrollees with longer enrollment periods.

Conditions excluded from analysis:
A number of the conditions we analyzed are not applicable to adults or women. Thus, we excluded a number of conditions from analysis in our validation cohort. These exclusions were confirmed by our panel of CF experts who selected the original list of conditions for analysis. Male infertility was excluded due to enrollee sex. Meconium peritonitis, meconium obstruction, neonatal jaundice, congenital cystic lung, newborn respiratory failure, congenital pneumonia, and childhood failure to thrive, were excluded as these conditions apply only to newborns or children. The following International Classification of Disease (ICD) diagnosis codes were used to identify CF carriers and subjects with CF. These codes were also used to identify households with potential prevalence of CFTR mutations. Note: X indicates that all nested sub-codes were used.   P-values correspond to the null hypothesis summarized in Methods S2, namely that the estimated effect can be explained by misclassification. P-values are computed for misclassification rates of 0.00295 and 0.00590. The percent and total number of subjects that would need to be misclassified, in order to obtain the estimates reported in Figure 2, on average, are also reported. Across all conditions, on average, 4,153 CF carriers would need to represent misclassified subjects with CF (median 3,055) in order to obtain estimates similar to ours. Note: only misclassification rates of up to 10,000 carriers (i.e. 50.5% misclassification) were simulated.  We performed one million simulation trials using both of the misclassification rates described. The number of conditions (out of 59) for which the simulated odds ratio was greater than the results reported in Figure 2 is summarized across the million trials. For the lower misclassification rate, there were never more than 14 conditions for which we could obtain simulated results, and 99% of the trials returned 7 or fewer conditions with similar estimates to those in Figure 2. For the greater misclassification rate, there were never more than 17 conditions for which we could obtain simulated results, and 99% of the trials returned 10 or fewer conditions, with similar estimates to those in Figure 2. These findings suggest the results across all conditions in Figure 2 are almost certainly not attributable to misclassification bias.  Table 2 Total Trials Table 2 Total Trials  We used a simulation analysis to estimate the potential for false discovery associated with multiple hypothesis testing. For each simulation trial, we drew simulated carrier and matched control cohorts using only non-carrier enrollees. We then computed the number of conditions that resulted in p-values ≤ those obtained in our study. Below the number of conditions that resulted in similar significance levels are reported across simulation trials.  CF carriers with procedure codes for chloride sweat testing (CPT code 89230) or expanded CF screening panels (CPT codes 81221, 81222, or 81223) were excluded for this sensitivity analysis. A total of 1,276 CF carriers were excluded, based on evidence that CF was suspected prior to genetic screening. By removing CF carriers with more severe disease presentations (i.e., suspected of having CF), results using the excluded cohort are intentionally biased towards the null hypothesis, especially those conditions often attributed to CF. Results using the primary study cohort and the reduced cohort are presented below.

Number of Conditions with p-value ≤ study value
Together, these results suggest that ascertainment bias cannot explain our general findings. For some of the rare conditions that would likely lead to suspicion of CF (e.g., pancreatic steatorrhea, hypertrophic osteoarthropathy/clubbing, meconium peritonitis, or congenital pneumonia) statistical significance fell below a 0.05% threshold with the reduced cohort. However, most results remained significant, and the estimated effects for both cohorts were nearly identical for the vast majority of the conditions analyzed. Moreover, many conditions that are highly characteristic of CF (e.g., aspergillosis, bronchiectasis, recurrent pneumonia, and pseudomonas or nontuberculous mycobacterial infection) remained highly significant.