Combining disparate data sources for improved poverty prediction and mapping

Significance Spatially finest poverty maps are essential for improved diagnosis and policy planning, especially keeping in view the Sustainable Development Goals. “Big Data” sources like call data records and satellite imagery have shown promise in providing intercensal statistics. This study outlines a computational framework to efficiently combine disparate data sources, like environmental data, and mobile data, to provide more accurate predictions of poverty and its individual dimensions for finest spatial microregions in Senegal. These are validated using the concurrent census data.

M ore than 330 million people are still living in extreme poverty in Africa (1). Consequently, the goal to "eradicate extreme poverty for all people everywhere by 2030" tops the list of the 17 Sustainable Development Goals adopted by world leaders at the United Nations summit in September 2015. The lack of good-quality and fine-grained data to assess poverty regularly features in discussions of the development agenda for Africa (2,3). Timely measurement and availability of data are vital in ending poverty.
Despite the nature of the strategies used to reduce poverty, governments and development agencies need a baseline depiction. Poverty maps provide such a spatial distribution of the socioeconomic deprivations and help policy makers assess the impact of interventions. For efficient targeting of policies at microregions and specific demographics, poverty maps should be made available at the finest administrative unit of planning. Also, these values should be disaggregated into individual dimensions of poverty, like deprivations in education, standard of living, health, and so forth (4).
Currently, the most reliable way to estimate poverty is through intensive socioeconomic household surveys. However, this approach is costly and time consuming and can only be realistically carried out for a small sample of households. The extrapolation of the local poverty estimation to a larger scale is traditionally done by exploiting links between census (wide area) and survey (smaller area coverage) data through small area estimation methods (5,6). These techniques depend on the timely availability of census, which is typically collected every 10 y and whose analysis is delayed for poorer economies by years, making timely updates of poverty challenging.
Recently, there has been a growing interest in realizing the potential of "Big Data" to understand societal development in Africa. However, most current studies are limited to using single source datasets, such as mobile phone data (7) or satellite imagery (8). Since poverty is a complex phenomenon, understanding it using multiple lenses obtained from diverse datasets will help to chart more accurate maps for poverty.
Several studies highlight that significant spatial variation of poverty may be due to a variety of geographic factors, including agrometeorological conditions, accessibility and proximity to markets, access to land, and so forth (9, 10) (see Table S3). Earth Observation Satellites collect data on metrics such as nighttime lights, vegetation cover, and meteorological conditions. The unique features of such datasets are their global coverage, high revisit capability, and free availability. A complementary resource lies in Geographic Information Systems (GIS) analysis. In particular, proximity to important services (schools, hospitals) and density of infrastructure (such as roads) are all factors that might contribute to alleviating poverty (11).
While satellite and GIS data are apt to observe and understand the availability of and access to natural resources and manmade structures, they lack information about population structure, especially the socioeconomic ties, cultural interactions, and micro-and macrobehavior that is essential to understanding poverty. One way to study societal interactions is provided by the widespread use of digital technologies (12). The Internet is still finding ground in sub-Saharan Africa. However, mobile phones are a prevalent technology, with adoption rates of more than 70%, even with 43% of population living in abject poverty (13). Such widespread use of mobile phones generates an unprecedented volume of data called call data records (CDRs). CDRs capture how, when, where, and with whom individuals communicate. These data, traditionally used by the telecommunication companies for billing purposes, capture both micro-and macropatterns of human interaction, while preserving the individual anonymity via spatial and temporal aggregation. Poverty has traditionally been measured in one dimension, usually income or consumption, called income poverty. Another internationally comparable measure is the Global Multidimensional Poverty Index (MPI), which is used in this study. Global MPI is a composite of 10 indicators across three critical dimensions-education (years of schooling, school enrollment), health (malnutrition, child mortality), and standard of living conditions (see Global MPI). Throughout the paper, "poverty" refers to the Global MPI, and "dimensions" refers to education, health, and standard of living. MPI is calculated as a product of the incidence or headcount of poverty (H) and the average intensity (A) across the poor. H is the proportion of the population that is multidimensionally poor. A is the average proportion of indicators in which poor people are deprived.
The study focuses on Senegal, a sub-Saharan country that suffers from persistently high poverty. This study uses mobile phone data in the form of CDRs, and data related to food security (availability and access components), economic activity, and access to services are grouped together as environmental data ( Table 1). The CDR variables capture not only the basic phone use statistics of a user but also the regularity, diversity, and spatiotemporal variability in the user's mobile interactions. Tables S1 and S2 detail the variables extracted from CDR and environment data, respectively. The poverty maps are produced at the spatially finest level of policy planning, called "communes," and validated at that level using the concurrent census data. Current poverty maps, based on Global MPI (see Fig. 1) and consumption-based measures (14), do not exist uniformly for all communes of Senegal. The map produced by our analysis is available for all 552 communes (see Fig. 2). Such maps can be generated frequently in between cycles of surveys and census, since CDR and environmental data are available at fine temporal granularity. The results are compared when single source data are available. Corr., Pearson's r correlation; rank corr., Spearman's rank correlation; RMSE, rms error. For both types of correlations, all P values were less than 10 −20 . An SD associated with the multiple runs for each measurement is reported within parentheses.
Our objective is to present a computational framework that integrates disparate data sources to accurately predict the Global MPI and its individual dimensions at the finest level of spatial granularity. This framework consists of models trained independently on each data source. Each source-specific model uses Gaussian process (GP) regression (GPR) (15) to infer poverty values. GP falls under the class of kernel methods, where the choice of different kernel functions enables one to learn different nonlinear relationships between the independent and target variables. Each GP-based model provides a probabilistic estimate of poverty for a given commune, including the mean and variance of the estimates. The variance provides a measure of uncertainty, which allows us to combine the predictions from the multiple data sources. An important advantage of this methodology is that the different data ecosystems need not share any data between them. The individual datasets remain private within their specific ecosystems, and only the output predictions and the associated variances are shared.

Results
GP Model for Predicting Poverty from a Single Data Source. To predict poverty for a commune from a single data source (CDR or environment), the following model is assumed: where yi is the target poverty value and xi is a vector of independent variables derived from the particular data source for the ith commune. The first term is a linear combination of the independent variables. The function f () models the nonlinear relationship between yi and xi . The residual term, , models the remaining unexplained noise and is modeled as a zero-mean Gaussian random variable-that is, ∼ N (0, σ 2 n ). Without the nonlinear term, f () in Eq. 1, the model is equivalent to ordinary linear regression. However, a linear model is not rich enough to capture the relationships between the target and the independent variables (see Fig. S6), thus motivating the need for a nonlinear term. Instead of assuming a fixed parametric form for f (), we adopt a nonparametric approach, by assuming a GP prior on f (). The generative process thus becomes: A GP is a stochastic process, indexed by x∈ R d . Any finite sample generated from it is jointly multivariate normal (15). m(x) is the mean of f (x) and k (x, x') is a kernel function that defines the covariance between any two evaluations of f (x)-that is, For model simplicity, we assume that m(x) = 0, which is a standard practice in GP-based methods (15).
Here, y = [y1, y2, . . .] , and K is a matrix that contains the kernel function evaluation on each pair of training inputs-that is, K [i, j ] = k (xi , xj )-and k is a vector of the kernel computation between each training input and the test input-that is, k[i] = k (x * , xi ), k * = k (x * , x * )-and I is an identity matrix.
Choice of Kernel Function. The role of the kernel function is to specify how the function values f (x) and f (x ) vary as the function of their corresponding inputs x and x . We use the following kernel function: where xs and x s are the spatial coordinates (latitude, longitude) of the commune centers corresponding to x and x , respectively. The first exponent term captures nonlinear dependencies in the feature space. The second exponent term plays the same role, but in the geographic space and models, the spatial autocorrelation is a continuous function, which is same as Kriging, a widely used method in geostatistics (16). The parameter σ 2 f is the variance of the stochastic process f , l is the process length scale for the feature space part, and ls is the process length scale for the spatial part.
The quantities β, , s , σ 2 n , and σ 2 f are estimated by maximizing the marginalized log-likelihood of the training data, as discussed in Materials and Methods. To remove the effect of spurious features, we couple the GP model with elastic net regularization (17) during the model learning phase. This allows for automatic relevant feature selection and learning a parsimonious model that improves interpretability.
Combining Source-Specific Models. To predict poverty for a commune, we use two independently trained models specified in Eq. 1, corresponding to the two data sources of CDRs and environmental data. Each model produces a posterior Gaussian distribution, denoted by yic ∼ N (ȳic, σ 2 ic ) and yie ∼ N (ȳie , σ 2 ie ) for the CDR and environmental data, respectively. The combined poverty estimate, yi , is assumed to be a mixture distribution consisting of two Gaussians, defined above, and the mixing weights defined as: The weights assign greater importance to the source that provides a smaller predictive variance, signifying higher confidence in the prediction for the particular commune. The mean and the variance for the combined poverty estimate are (see Estimating Moments of a Mixture Distribution): Predicted MPI Poverty Values. The predicted map of MPI for microregions-that is, 552 communes of Senegal-is depicted in Fig. 2, Left. Compared with the current poverty map in Fig.  1, our map highlights heterogeneity in the existence of poverty within each macroregion. The communes toward the interior of the country have more poverty compared with the rest. The west regions, containing the capital city Dakar, and communes neighboring the coastal boundary are less poor than the rest of the country. Of special interest is the spatially large division in the south, consisting of the regions of Tambacounda, Kedougou, and Kolda, which are depicted as one color on the current map in Fig.  1 but have communes of varying poverty values spread throughout. Interestingly, the communes in the Kedougou region in the extreme southeast corner of Senegal are predicted as wealthier than other communes within the region. The communes in the region of Ziguinchor, in the southwest corner, are wealthier compared with other communes in the south. This is attributed to the fact that Ziguinchor is the second largest city in Senegal, with the economic advantage of being a port and a tourist center. The 121 urban centers are shown as small circles on the map and, in general, have less poverty values compared with rural communes. The population in urban centers is generally richer than the population living in adjacent rural communes. This is true even for very poor communes of Senegal in the regions of Kaffrine and Tambacounda in the center, for which the contrast is even higher. The urban centers bordering with the neighboring country Mauritania, in the northeast, are wealthier; this could be attributed to the economy of the Senegal river basin and to cross-border trade. The predominantly urban areas in Dakar are shown enlarged in the map. All communes in Dakar are more well-off than the rest of Senegal because of the concentration of economic activity over the years.
A quantitative validation of the predictions is provided against commune-level poverty values estimated from census data (see Fig. 2, Right) using cross-validation (CV) procedures (details in Materials and Methods). A standard CV is often performed to ensure that the model generalizes to out-of-sample data. We performed a standard 10-fold CV, where the data are randomly split into 10-folds. Each time, ninefolds are used for training, and singlefold is used for evaluation, meaning we randomly assign 90% of communes to the training set and evaluate the remaining 10% of communes. This procedure is repeated 250 times to provide a robust assessment of the variability of model parameters and prediction statistics. Using standard CV, the model gives a Pearson's correlation of 0.94, with a P value of <0.0001. Though training and evaluation data are selected randomly, the above-described method of validation may prove to be insufficient, as the poverty deprivations tend to be spatially correlated. Thus, a model may appear to perform well when evaluated this way, even though it may have poor extrapolation power in the spatial sense. The above results are provided for comparison.
To measure the extrapolation ability of the model to spatial areas that were not represented in the training data, we use a spatial CV procedure (18)

(details in Materials and Methods).
Here, the training and evaluation sets are sampled from geographically distinct regions ensuring that the model is tested rigorously. The experiments were repeated 250 times with random samples of training and evaluation sets, while ensuring that all communes are represented in the evaluation. We report Pearson's and Spearman's correlations, and rms error (RMSE), averaged over the multiple CV runs. The predictions in Fig. 2, Left have a spatially cross-validated Pearson's correlation of 0.91 and rank correlation of 0.87, with P values less than 10 −20 for both tests, indicating strong significance. This emphasizes the efficacy of our model in predicting poverty values accurately at the finest spatial granularity, using multisource data.
As a comparative study of how our model performs using multisource and single-source data, we experimented with three datasets-Multisource, CDR, and Environment-to predict H, A, and MPI at the commune level (see Table 2). We report highly accurate results for all three targets (H, A, and MPI). Rank correlations are preserved, as we report a Spearman's correlation of 0.85 for both H and A. The values of Pearson's r correlation are much higher than rank correlation, across all prediction tasks, indicating the linear correspondence of the poverty values with the predicted ones. We report significantly low P values (< 10 −34 ) for spatial CV compared with standard CV, signifying more stable performance. For detailed results, see Table S5. Table 2 shows that combining multiple data sources (CDRs and environmental data) results in a consistent improvement of accuracy over using the individual data sources. The improvement is more pronounced in detailed results for all of the indicators of poverty and given in Table S5. Fig. 3, Left plots the relationship between MPI values predicted by our model and those estimated from census. We observe a linear relationship, in general, for MPI, with lower values for urban areas (shown in red) and higher values for rural areas (shown in blue). Predominantly urban communes of Dakar and a few urban centers are underestimated for poverty (i.e., they are predicted richer than they are). Likewise, there are very few rural communes, where poverty is overestimated. We also observe that for communes with lower population densities, the predicted variance is comparatively higher than it is for communes with higher densities, signifying that lesser numbers of data points in the vicinity of a given commune contribute to its higher variance (see Fig. S5).
Predicted Values for the Dimensions of Poverty. Global MPI consists of 10 individual deprivation indicators grouped along three dimensions: (i) education (indicators-years of schooling and school attendance), (ii) health (indicators-child mortality and nutrition), and (iii) standard of living (indicators-cooking fuel, sanitation, access to drinking water, electricity, and floor and asset ownership). Each individual deprivation indicator is taken as the target of our model, and the averaged spatially cross-validated results, along the three dimensions, are reported in Table 2. Detailed results for each of the 10 indicators are given in Table S5.
Referring to Table S5, we note that the accuracy of the model is high for some deprivations and good for most deprivations. All deprivations are better predicted using CDR data, probably because they characterize the individual behavior while environmental data depict conditions that might have an influence on poverty (see Tables S1 and S2). Fig. 3, Top Right compares our predictions for asset ownership with those estimated from the census. Rural communes depicted (by blue) are clustered closely toward high deprivation. The urban areas have, generally, lower deprivation than rural areas, though it is spread out.
Indicators related to education-years of schooling and school attendance-are predicted well, because use of short message service (SMS) is indicative of literacy (19). The environmental data also perform well, because they capture the distance to schools, main roads, and urban centers, all of which facilitate access to educational attainment. Fig. 3, Bottom Right shows that all areas of Senegal are deprived in education, as the rural (in blue) and urban (in red) points are spread evenly on the plot. However, rural areas tend to dominate at the very high deprivation index, while very low deprivation areas are urban.
The model performs poorly for the indicators within the health dimension-that is, child mortality and nutrition. This is attributed to the fact that our data are not representative of the children population, and thus, the features extracted from CDR data do not capture this deprivation. A similar inference can be drawn for poorer correlations for nutrition. Moreover, the validation of deprivation values computed from the census for nutrition indicators are based on two hunger-related questions, as detailed nutritional information is not available to us (see Table  S7 for details).
Dimensions of Poverty-Interpretation of Weights. Figs. S2 and S3 display the features deemed important by our model for the environment and CDR data, respectively. The important features are those for which the corresponding entries in the coefficient vector, β, are high in magnitude. We ignore child mortality and nutrition, as our model does not perform very accurately for these two indicators. The following interpretations are given for information purposes. These are, by no means, indicators of causality.
Referring to Fig. S2, nighttime lights appear to be the most important feature regardless of the predicted dimensions, conforming to the current research (8,20). Nighttime lights show a strong correlation with MPI (Spearman correlation of −0.66). Urban areas and road density, two other important indicators of economic activity, are relevant but to a lesser extent. Even though the coefficient values of each dimension are not directly comparable, since each dimension was taken as a separate target, it is interesting to note that the weights of nighttime lights intensity for electricity and asset ownership deprivation are the highest. This result confirms previous findings (21) that access to electricity is correlated with nighttime lights (Spearman correlation of −0.67). Additional observations regarding water deprivation, food security (access component), and climate are given in Interpretation of Weights-Along the Dimensions of Poverty.
A similar analysis for the CDR features reveals several interesting insights regarding the relationship between poverty and the individual characteristics captured in CDR features. While we considered CDR features for each month individually, for the ease of visualization (see Fig. S3), we average the monthly values of the weights associated with each feature.
Here we discuss the CDR features that were selected by the model as the strongest predictors for the various targets. These features are listed in Table S6. One of the strongest negative predictors for most of the targets is the number of active days (for call and text), which characterizes that individuals in wealthier communes have monetary resources to recharge their phone and make/receive calls. The ratio of calls vs. text shows the preference for calls and emerges as an important factor to predict education-based deprivations. The feature "interevent time call" measures the irregularity in responding to calls/text and emerges as a positive predictor for deprivations. Features that indicate diversity in communication, such as entropy of contacts and interactions per contact (call and text), report a negative relationship to poverty. These results confirm previous findings (7, 22, 23) that diversity of an individual's relationships is positively correlated with his or her economic wellbeing. However, for features such as percent pareto interactions and balance of contacts, which are proportional to an individual's diversity in communication, we report a positive relationship with poverty. This counterintuitive relationship needs to be further studied in the context of telecommunication patterns in Senegal.
We observe a negative relationship between the "activeness" of an individual in his or her mobile interactions and poverty. For instance, the delay in responding to text has a positive relationship to poverty. Interestingly, the feature of percent initiated interactions (calls) has, again, a positive relationship to poverty, signifying that in Senegal individuals living in more deprived communes are more likely to initiate calls (for request of resources, etc.) than those living in less deprived communes. The mobility patterns of individuals, captured using spatial features such as number of frequent antennas, entropy of antennas, and total number of antennas used by an individual, indicate a negative relationship to poverty. Thus, individuals living in more deprived communes tend to move fewer antennas than those living in less deprived communes. This observation should be viewed cautiously because of sparse antenna density in rural communes.

Discussion
The technological advances over the past decade have led to building of communication devices (like phones) and sensors (like satellites and weather and ground sensors) that produce and store a myriad of data. In this work, we show how these novel sources of data, which are characterized by their volume, variety, and associated uncertainty, can be used to generate accurate poverty maps.
We outline several challenges that lie in establishing relationships between auxiliary data sources (that are not collected to directly measure socioeconomic deprivations) and poverty. The first challenge occurs due to the varying spatial granularity at which the different datasets are available; this requires an aggregation mechanism to link them. CDR data are available for each subscriber, while environmental data have mixed spatial resolution, from very accurate vector data to low-resolution satellite imagery (1 km). On the other hand, census data are available for individuals or households, depending on the response variable. However, given that the individual information is anonymized for both CDRs and census data, there is no obvious way to link the records across these two datasets. In this work, we localize the individuals and/or households to their respective communes, or urban centers, by using their census information (details in Materials and Methods). This lets us calculate the commune-level deprivations. For CDRs, the individuals are localized to their home antennas based on their most frequent night location. The CDR and environmental data are aggregated to commune levels. Though we have taken a commune as the level of aggregation, the framework allows for the same analysis at even finer spatial resolutions.
A key concern associated with using CDR data for populationlevel analyses is the selection bias arising from mobile phone ownership. In Senegal, however, there were 92.93 mobile phone subscriptions per 100 inhabitants in 2013, which implies that most of the population owns cell phones (24). The second challenge is the bias arising when using data from only one provider. However, the provider of the data used here, Sonatel, had nearly 62% of the cell phone market in 2013 (25). The third concern is that some demographic subgroups like children and the ultra poor are left out by the analysis while only using CDR data. Also, results may be biased toward urban regions, rather than rural regions, because of factors like lack of electricity in rural areas.
Here, we used two distinct types of environment data. The first type includes static natural/physical environment variables (like elevation, soil types, etc.) or long-term dynamic phenomena (like climate). The second type includes human-induced aspects, like urban areas, roads, access to facilities, and so forth. Though the natural environment acts as a constraint in designing poverty eradication plans, effective policies and sustainable approaches should be made an integral part of policy planning. Environmental features derived from satellite images (nighttime lights, NDVI, etc.) have the potential to be computed in near real-time to monitor the impact of shocks such as natural hazards, armed conflicts, or crop pests that can rapidly cause serious deprivations. However, for reliability, these variables need to be aggregated for a longer period, typically at an annual level for nighttime lights and for the growing season for NDVI. Open-StreetMap (OSM) data, which are used to map facilities and roads, are crowd-sourced and therefore have the (theoretical) potential to be updated in near real time. This capability could be limited in African countries. Due to the above constraints, 1 y is probably the relevant period for consistent monitoring of poverty with our method (compare with 3-5 y for a detailed and costly census).
Another challenge is the ease of availability of data. Environmental datasets are available to researchers for free and typically have no privacy constraints, especially at the resolution at which it is analyzed here. CDR data are collected by commercial telecommunication entities and might suffer from lack of accessibility to researchers due to sharing constraints between different organizations. However, our methodology requires no raw data to be shared between different data-owning entities; only the output predictions from each individual model and the associated uncertainties are combined at the final step.
An important consideration is the number of features extracted from the data. Recent work (20) has used four featuresnamely, call volume and mobile ownership per capita, nightlights, and population density-to estimate the MPI of sectors in Rwanda using a linear regression model. As a baseline for our model, we used the same features and model to predict MPI values at the commune level in Senegal. A spatially crossvalidated Pearson's correlation of 0.84 was achieved with a significant P value (<0.0001) (see Table S8 for comparison). Although less features provide computational tractability of analysis, they offer no insight into other features that could be useful in understanding poverty. Also, linear models are limited in their ability by the linearity assumption and sensitivity to outliers.
An important advantage of our GPR model is that each predicted poverty value is associated with an uncertainty (generated by the model). This highlights the strength of confidence in the predictions and can be used as guidance by policy makers. Comparing these source-specific uncertainties can reveal which data hold a better signal for a specific prediction (see Fig. 4). We note that for predicting A, the predictions of CDRs and environment data are comparable for most of the communes. For predicting H, CDRs perform with lower uncertainties than environmental data. These variations may be attributed to multiple reasons, including resolution and concurrency of data, demographics and mobile penetration of the cellular provider, and spatial heterogeneity of poverty deprivations.
Though we have discussed the methodology for predictions at the commune level, our predictions of MPI and associated dimensions can be successfully aggregated to coarser administrative units, if needed, for policy planning. Since we use global MPI as the poverty index, its limitations, as noted by global MPI researchers (26), are applicable to our study as well. In particular, global MPI does not include characteristics such as parents' education, social norms and beliefs, empowerment, etc.
Additionally, it will be interesting to see how well this methodology can be used to predict other indicators of deprivation and inequality, like the GINI index, at the microregional level. Apart from being useful in producing interim statistics in between long cycles of census and surveys, such methodology can also be extended to places of conflict or remote areas that are difficult to reach by census takers.
As described in the results, the interpretation of the model coefficients provides some insights on the dimensions of MPI. However, due to the number of variables, this interpretation is still complex and not necessarily straightforward for policy intervention. Conversely, the MPI dimensions are well-known factors for which policy planning is feasible (26). As an illustration, Fig. S4 shows the highest predicted deprivation for each commune within each dimension.
Lastly, though GPR model uncertainty is impacted by the bias and inaccuracy of each data source (quality of soil type map, interpolation of climatic data, missing facilities, mobile operator's market share), a higher resolution and accuracy of the input data should benefit the modeling relevance and quality.

Materials and Methods
Target Country. Senegal is a sub-Saharan country that ranks 170 on the Human Development Index with a score of 0.466 and a population of 14.5 million (with 43.5% urban population) (27). As one of the poorest countries in the world, it has 52% of the population living in multidimensional poverty (27). On the other hand, there are 98.8 mobile phone subscriptions per 100 people (24). Senegal is composed of 14 coarsest administrative units called regions, which are further divided into 45 administrative units called departments. The finest level of administrative units is called a commune. There are 552 communes (121 as urban centers and 431 rural) (Fig. 1).
Data Sources. CDRs. A CDR consists of an identifier with the caller and callee, the antenna location of the caller, the time of the call, duration of the class, and a flag indicating if the record is a text or a call. A CDR is generated each time a call or text is placed. The data belong to the subscribers of Sonatel, Orange, which is the dominant telecom provider in Senegal. The data are anonymized and span a period from January 1 to December 31, 2013. They contain more than 9.54 million unique aliased mobile phone subscribers. The population of Senegal in 2013 was 14.13 million. Additionally, the geographical coordinates of the mobile antennas are known, and shown in Fig. 1. Environmental Features. Based on literature, several environmental features that may have a relationship with poverty have been explored (see Table S1). They are either based on Geographical Information System (GIS), Earth Observation data, or weather stations. Census. The Agence Nationale de la Statistique et de la Demographie (ANSD), which is the National Statistics Office of Senegal, provided us with a 10% sample of the 2013 census [called RGPHAE (Recensement General de la Population de l'Habitat de l'Agriculture et de l'Elevage)]. The data are evenly sampled across the entire population of Senegal and are from 1.4 million individuals, spread across 150,000 households, characterizing information related to demographic statistics (mortality, fertility, migration, literacy, occupation, etc.), along with habitat features, such as type of roof, floor, access to drinking water, sanitation, and agriculture practices. The advantage of the census is that it represents important national statistics at the level of individuals. Brief statistics of the data sources are given in Table 1.
The mobile phone data used in this study can be obtained for replication purposes by contacting Zbigniew Smoreda (zbigniew.smoreda@ orange.com).
Feature Extraction. CDRs. We have access to more than 11 billion mobile phone transactions involving calls and texts for a year in Senegal. Each time a call or text is placed, it is logged as a transaction. Missed, forwarded, and other undelivered calls were removed from the logs.
To extract important features that quantify the mobile use pattern of a subscriber, we focus on well-studied metrics capturing the individualistic, spatial, and temporal patterns of the subscriber (28)(29)(30). The individual aspects quantify the typical use pattern of an individual. Some of the metrics that belong to this category are the number of active days, the number of contacts, the average call duration, percent nocturnal, and so forth. Spatial metrics are the ones that quantify the typical movement pattern of an individual. Examples of spatial metrics for a subscriber include radius of gyration, entropy of antennas, and so forth. There are 43 core features (briefly described in Table S2), extracted using the Bandicoot toolbox (31). All features were calculated at monthly granularity capturing the temporal aspect of a subscriber, resulting in 43 × 12 CDR-based features.
The second step is to localize each subscriber, i, to his or her home antenna. A home antenna, h i , is calculated as one from where the subscriber makes the most nocturnal calls (from 7 PM to 7 AM) during each month. We filtered out individuals who made less than five calls during each month and who were not active for at least half of the year within the range of their home antennas. This ensures that individuals are reliably allocated to their home antennas. After the filtering step, the sample contained 6.19 million individuals (65% of the original subscriber population).
We then computed the average feature value for each antenna site by computing the average of the feature values for all individuals who consider that antenna as their home: where m (f) i is the fth feature value. Finally, we compute the feature value for each commune as the weighted average of all antennas whose voronoi polygon intersects with the commune boundary as: The weight wc,a is the ratio Area(c∩a) Area(a) , which is a measure of how much of the voronoi cell for antenna a falls within the boundary of commune c. To study how well the Voronoi-based approach has performed in assigning people to their communes, we study the correlation of the commune population estimated by this approach and that calculated from census. The Pearson's correlation is reported as 0.85 with a P value of < 0.00001, thus ensuring the validity of our approach. Environmental features. In this study, we focus on three broad categories of environmental features: food security (divided into the availability and access components), economic activity, and access to services (see Table S1). These three categories cover most of the features that have been shown to be significantly related to poverty in the literature (see Table S3).
Food security is mainly described by agrometeorological measurements (temperature, precipitation, slope, elevation, soil type) that drive agricultural production (crop production), one of the most important inputs, along with livestock and fishing, of food availability in the country. On the other hand, access to staple food can be approximated by the average millet prices observed in the markets (retail prices in 56 local markets). Millet serves as the main local staple food crop in the country, making it a potentially good indicator of poverty. In addition, proximity to main road and urban centers was also computed to describe the connectivity to major markets.
The economic activity corresponds to the intensity of urbanization. Among the studied features, the nighttime lights are the most frequently used to describe poverty using remote-sensing data (20). Moreover, a clear link between household wealth and the level of night light emissions has been shown before (32). The underlying hypothesis is that economic activity and urbanization are strong indicators of living standards.
Finally, the access to services can help to predict some of the individual indicators of poverty. In particular, the proximity to school, water towers, and hospitals can be used to determine the deprivation in education, water, and health, respectively.
The raw environmental data are available either in raster grid (at different spatial resolutions) or in vector format. As a first step, all vector data were converted into raster grid format. Then, all data layers were resampled (using the nearest neighbor approach) at a spatial resolution of 100 m. Pixel values falling within each commune's boundary were averaged to give a unique value for that commune.
All environmental data are available at high spatial resolution, with the exception of crop production and millet prices (see Table S1 for the data sources). Millet prices were available in 56 local markets, potentially missing some of the local heterogeneity. Production estimation features were derived from the Direction de l'Analyze, de la Prévision et des Statistiques Agricoles (DAPSA) database. The granularity of these features is at the department level. Cultivated areas were masked using the 2005 1:100,000 Scale Senegal Land Cover Map produced by the Global Land Cover Network based on the GlobCover 2005 map (33), which is the most accurate map for Senegal (34). Since reliable information on the spatial distribution of each crop is unavailable, we made an assumption that they were grown evenly within the cultivated areas of a specific department. Therefore, the production of a specific department was distributed evenly among all of the 100-m pixels that fell within the cropland of this department. This raster was then used to aggregate the production estimations by communes.
The Normalized Difference Vegetation Index (NDVI) is used as a proxy of potential agricultural production within a department. The NDVI, defined as the difference between near-infrared and red reflectances normalized by the sum of the two parameters, is a useful yield proxy in regions where water or soil fertility are the main limiting factors, such as Sahel (35,36). For each pixel within cultivated areas, NDVI values above 0.2 during the growing season (July to November) were integrated (TNDVI), which limited the contribution of bare soil to the signal.
Model Training. The unknown parameters of each source-specific model in Eq. 1 are as follows: the parameter β of the linear component, the hyperparameters of the kernel function , s , σ 2 f , and the variance of the error term σ 2 n . These are estimated by maximizing the marginalized likelihood of the target poverty values in the training data y. The marginalized likelihood is obtained by taking the integral of the likelihood times the prior: where the matrix X contains the training input vectors as rows and f is a vector containing the latent function values for the inputs in X. The GP prior means that p(f|X) ∼ N (0, K) and the likelihood is a Gaussian-that is, p(y|f, X) ∼ N (Xβ + f, σ 2 n I). The integration of Eq. 11 yields the following marginalized log likelihood (15) of the training data: where N is the number of training examples.
To regularize the coefficients in β, we apply elastic net regularization on the marginalized log likelihood to obtain the following objective function: The function J is maximized to estimate the hyperparameters using conjugate gradient descent (37). All codes used to replicate the results can be obtained by writing to the corresponding author.
Regularization. Regularization techniques, such as those used in Lasso (38) or Ridge regression (39), are often used to improve model performance, especially when the data contain several irrelevant features. The L 2 penalty, imposed by Ridge regression, ensures shrinkage of regression coefficients to avoid overfitting. On the other hand, the L 1 penalty imposed by Lasso forces the coefficients to be sparse, thereby providing feature selection. However, neither of the two regularization methods have been found to universally dominate the other (38). For instance, in the presence of groups of correlated features, Lasso tends to select only one feature within each group, which leads to poor interpretability of the estimated coefficients. Elastic net regularization (17) is a weighted addition of L 1 and L 2 penalties and combines the strengths of both Lasso and Ridge regression. It is known to select a greater number of influential features than Lasso and has a lower falsepositive rate than ridge regression.
We used elastic net regularization to penalize complexity of the solution and to avoid overfitting on the limited training dataset. The elastic net penalty is computed as: Our empirical results show that elastic net regularization results in better prediction accuracy, compared with ordinary least squares, Ridge, and Lasso regression.
Model Validation. This section details the steps followed to validate our model, namely creating commune-level poverty statistics from census data and methodology for spatial CV. Creating commune poverty statistics from census. The 10% sample of the 2013 RGPHAE census, used here, has survey responses for 150,000 households and 1.4 million individuals pertaining to their socioeconomic indicators (literacy, birth and death in the family, etc.) and habitat (type of house, access to electricity and drinking water, etc.). Some survey responses are individualistic (like literacy and profession), while others are associated with the entire household (like type of roof, sanitation, electricity). The first step is to assign the individuals to their respective households using information from the fields in the census. The second step is to calculate per-household deprivations in the poverty indicators of interest. Global MPI computation (26) requires deprivations along three dimensions (with 10 indicators)-namely, health (child mortality, nutrition), education (child school attendance, years of schooling), and standard of living (electricity, sanitation, drinking water, flooring, cooking fuel, assets).
We follow the procedure similar to the widely used Alkire-Foster methodology for computing MPI (40). First, we create a deprivation vector depvec i,d corresponding to each household i in poverty indicators d = 1, . . . , D. Each vector entry is either 1 if y i,d ≤ z d , where y i,d is the achievement of household i in indicator d and z d is the cutoff score in indicator d, or 0 otherwise. A value of 0 for depvec i,d implies nondeprivation of the household in that particular indicator. For the values of cutoff scores for different indicators, see Table S7. We aggregate all households that are deprived in a particular indicator, for each commune, and divide by the total number of households in that commune. This score gives the proportion of households deprived in a particular indicator within a commune.
Since MPI is a multiplicative combination of H and A-that is, MPI = H × A-we first calculate H and A. For H, we introduce a weight, w d , for each indicator d. For each household, we compute a weighted deprivation score, c i = D d=1 w d depvec i,d . The weights w d are assigned as follows. The education-and health-related indicators are given a weight of 1 6 , while each of the six standard of living indicators are given a weight of 1 18 . Thus, each dimension has a weight of 1 3 . H j , which is the relative headcount of poor households in commune j, is calculated as: where θ is a cutoff, whose higher values mean a higher cutoff for household achievement, and I(c i > θ) is the indicator function. N j is equal to the total number of households in the jth commune.
To calculate A, we count only the poor households, and their deprivations, as follows: The value of threshold θ is taken as 0.3. We varied θ from 0. Spatial CV. To measure the extrapolation capacity of the model on outof-sample data, spatial CV techniques, where training and evaluation sets are sampled from geographically distinct regions, are more robust (18,41). The following spatial CV strategy was adopted: For each CV run, we first randomly sampled a region r from the set of 14 regions and then randomly sampled a commune c belonging to r. All communes that lie within distance d of the commune c are included in the training dataset. The remaining communes are included in the evaluation dataset. This strategy ensures that communes from all regions of Senegal are represented in the training and evaluation datasets during CV. To ensure that the training dataset has enough examples, we forced at least 40% of the communes (225) to be included in the training dataset. To achieve this, d is initially set to 100 km and is increased by 50 km until the size of the training dataset meets the threshold. CV is repeated 250 times. We report the mean predictive performance (using Pearson's and Spearman's correlation and RMSE values) on the evaluation dataset, along with the SD across multiple runs.