National population mapping from sparse survey data: A hierarchical Bayesian modeling framework to account for uncertainty

Significance High-resolution population estimates are essential for government planning, development projects, and public health campaigns, but countries where this information is most needed are often where recent national census data are least available. We present a modeling framework that combines recent neighborhood-scale microcensus surveys with national-scale data from satellite images and digital maps to estimate population sizes for every 100-m grid square nationally. We present a case study from Nigeria where population estimates with national coverage were produced using household survey data from 1,141 locations. This work represents a significant step toward achieving high-resolution population estimates with national coverage from sparse population data while providing reliable estimates of uncertainty at any spatial scale.

Legends for Datasets S1 to S3 SI References

Supplementary Information Text
We have included detailed model results to give readers additional context to interpret the model results and the inputs needed to independently replicate the model.
Dataset S1 provides the input data to the JAGS model and the Dataset S2 provides source code for the model (JAGS language). These datasets can be used to independently run the model for further assessments. They could also be used as a starting point for modifying the model to apply it in new contexts.
The MCMC traceplots (Dataset S3) show the marginal posterior distributions for all parameters in the model. The traceplots do not show indications of influential priors (e.g. truncated tails of posteriors) or identifiability issues (e.g. parallel unconverged MCMC chains for a parameter) that could have arisen from interactions between the hierarchical intercepts and the hierarchical variances.
We included a breakdown of model assessments by settlement type in Table S1. It is clear from these results that the model performs differently in different settlement types, but even where the model is imprecise, the 95% prediction intervals include the observed values about 95% of the time. This indicates that the error structure is robust and it emphasizes the importance of accounting for uncertainty in population estimates because the mean predictions are often imprecise at small spatial scales, where population densities vary significantly across space, and in data-poor regions. Table S2 shows how the observed population compared to the Bayesian posterior prediction at each location. These results suggest that the model's variance structure is adequately accounting for uncertainty in population estimates because the expected proportions of observations fall outside the prediction intervals. For example, about 5% of observations were less than the 5 th quantile of the posterior prediction for the location. This trend was evident for insample and out-of-sample observations for predicted population densities and counts.
To ensure that our model structure (particularly the hierarchical random intercept) was accounting for spatial autocorrelation, we assessed Moran's I of the model residuals ( Figure S1) for ranges up to 100 km. We found no spatial autocorrelation except one significant value at the smallest range size for the raw residuals of total population size, but there were no significant Moran's I values for standardized residuals or when population densities were treated as the response variable (raw or standardized residuals). We interpreted these results as indicating that the model structure was adequately accounting for spatial autocorrelation, leaving little spatial structure in the residuals.

Fig. S1
. Spatial correlograms showing Moran's I statistics calculated at ranges from 0 to 100 km to assess spatial autocorrelation of model residuals. Red dots indicate a statistically significant Moran's I statistic (P<0.05). Correlograms were constructed from residuals in population size (N) and population density (D) using raw residuals as well as standardized residuals. Table S1. Model fit assessment for each settlement type separately. "Response" indicates the response variable being assessed (N=population count, D=population density). "Sample" indicates whether the assessment was based on in-sample or out-of-sample observations. "Type" indicates the settlement type. "n" indicates the sample size. "Obs.mean" is the mean of observed data and "Obs.sd" is the standard deviation. "Obs.inCI" is the proportion of the observations that fell within the 95% prediction intervals. "Bias" is the mean of the residuals. "Impr" is imprecision calculated as the standard deviation of residuals. "Inac" is inaccuracy calculated as the mean of the absolute residuals. "r2" is the rsquared value calculated as the squared Pearson correlation coefficient. Residuals were calculated as the mean of the posterior prediction minus the observed value.  Table S2. Proportion of observed data points that were less than the reported quantiles from the Bayesian posterior predictions. "Response" indicates the response variable being assessed (N=population count, D=population density). "Sample" indicates whether the assessment was based on in-sample or out-of-sample observations. The remaining columns show the proportion of microcensus locations where the observed data that were less than the reported quantile of the posterior prediction for the location. For example, the "Q5" column shows the proportion of microcensus locations where the observed data were less than the 5 th quantile of the posterior prediction for the location. Dataset S1 (separate file). Input data for the hierarchical Bayesian population model. "N" is the observed number of people in each survey cluster. "A" is the total settled area in the survey cluster. "type" is the settlement type. "region" is a region identifier. "state" is a state identifier. "local" is an identifier for local government area. "x1" is the scaled average of WorldPop Global population estimates in the cluster. "x2" is scaled school density within a 1km radius. "x3" is the scaled average household size. "x4" is scaled settled area within a 1 km radius. "x5" is scaled residential area with a 1 km radius. "x6" is scaled non-residential area within a 1 km radius.

Response
Dataset S2 (separate file). A text file containing JAGS code for the model.