New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
International migration beyond gravity: A statistical model for use in population projections

Contributed by Joel E. Cohen, August 18, 2008 (received for review June 18, 2008)
Abstract
International migration will play an increasing role in the demographic future of most nations if fertility continues to decline globally. We developed an algorithm to project future numbers of international migrants from any country or region to any other. The proposed generalized linear model (GLM) used geographic and demographic independent variables only (the population and area of origins and destinations of migrants, the distance between origin and destination, the calendar year, and indicator variables to quantify nonrandom characteristics of individual countries). The dependent variable, yearly numbers of migrants, was quantified by 43653 reports from 11 countries of migration from 228 origins and to 195 destinations during 1960–2004. The final GLM based on all data was selected by the Bayesian information criterion. The number of migrants per year from origin to destination was proportional to (population of origin)^{0.86}(area of origin)^{−0.21}(population of destination)^{0.36}(distance)^{−0.97}, multiplied by functions of year and countryspecific indicator variables. The number of emigrants from an origin depended on both its population and its population density. For a variable initial year and a fixed terminal year 2004, the parameter estimates appeared stable. Multiple R^{2}, the fraction of variation in log numbers of migrants accounted for by the starting model, improved gradually with recentness of the data: R^{2} = 0.57 for data from 1960 to 2004, R^{2} = 0.59 for 1985–2004, R^{2} = 0.61 for 1995–2004, and R^{2} = 0.64 for 2000–2004. The migration estimates generated by the model may be embedded in deterministic or stochastic population projections.
International migration will play an increasing role in the demographic future of nations if fertility continues to decline in most countries. In projecting international migration, the United Nations Population Division (ref. 1, paragraphs 57–59) identified the need for a demographically plausible, programmable algorithm that automatically projects a zero world balance of net migration and prevents projected net emigration from completely depleting the population of any sending country. To meet this need, we propose an algorithm (based only on demographic and constant geographic variables) for projecting future numbers of international migrants from any country or region to any other. It is comparable in transparency and generality to standard cohortcomponent methods of projecting births and deaths. The approach presented here is different from methods of projecting migrant flows currently practiced in international demographic institutions, the United States, European countries, and other developed countries (2–6).
Most theories of international migration draw on social, economic and/or political factors to explain migration (4, 7–9), such as differences among countries in gross domestic product, labor markets, migration policies, social networks of prior migrants, and cognitive and behavioral attributes of individuals (3, 10–11). For multidecadal demographic projections, it seems more difficult to project such nondemographic variables than it is to project demographic variables such as fertility and mortality. The proposed model assumes the availability only of constant geographic variables and of population sizes that can be projected incrementally in time by accepted demographic procedures. The model makes possible both deterministic and stochastic projections of migration and hence of population.
The intellectual antecedents of the proposed model include Zipf's (12, 13) model of intercity migration, which is one of several “gravity” models in the social sciences (6). Zipf (12) aimed to “show with supporting data that the number of persons that move between any two communities in the United States whose respective populations are P_{1} and P_{2} and which are separated by the shortest transportation distance, D, will be proportionate to the ratio, P_{1}·P_{2}/D, subject to the effect of modifying factors.” Unlike Zipf, we distinguish the number of people who move from community 1 to community 2 from the number of people who move from 2 to 1. Taking logarithms of P_{1}·P_{2}/D and adding an error term yields a linear model in logtransformed variables, log(migrants) = a_{0} + a_{1} log(ppnorig) + a_{2} log(ppndest) + a_{3} log(distance) + error, where ppnorig is the population of the origin, ppndest is the population of the destination, and the error term characterizes random deviations. Here and throughout, log refers to log_{10}, and ln refers to the natural logarithm log_{e}. Zipf posits that a_{1} = 1, a_{2} = 1, and a_{3} = −1. We estimate all coefficients from data using a generalized linear model (GLM) (14).
Zipf (12) treated cities as points of negligible spatial extent. Our communities are countries or regions and our subject is international migration. To let the data reveal whether the area of a country influences its numbers of migrants, we add two terms to the above equation: log(areaorig), the log area of the origin, and log(areadest), the log area of the destination. By definition, the population density of the origin is ppnorig/areaorig, so log(density) = log(ppnorig) − log(areaorig). If the number of migrants from origin to destination depends on ppnorig and ppndest and not on their areas, then the estimated coefficients of log(areaorig) and log(areadest) should be close to zero. However, if the number of migrants from origin to destination depends on the population density of the origin and the population density of the destination, but not on their respective population numbers per se, then the estimated coefficients of log(areaorig) and log(areadest) should be nearly the negative of the respective estimated coefficients of log(ppnorig) and log(ppndest). The estimated coefficients of the terms for log population and log area of origin and destination reveal the relative importance of population per se and population density.
To allow for differences in migratory intensities among origins or destinations, we let the data reveal whether different origins and destinations had numbers of migrants different from the numbers of migrants expected on average from their respective populations, areas, and distances. We introduced four indicator variables for each country that provided data. For example, Australia was a source of data on numbers of emigrants and immigrants by year. One indicator variable for Australia, orig.indicator$Australia, equaled 1 whenever Australia was the origin and equaled 0 if any other country or area was the origin. A second indicator variable, dest.indicator$Australia, similarly equaled 1 whenever Australia was the destination and equaled 0 if any other country or area was the destination. A third indicator variable, orig.is.datasource$Australia, equaled 1 if Australia was the origin and was the source of the data and otherwise equaled 0. A fourth indicator variable, dest.is.datasource$Australia, equaled 1 if Australia was the destination and was the source of the data and otherwise equaled 0. Similar indicator variables were defined for each country that provided data on its flows of immigrants and/or emigrants.
To allow for the possibility that migration rates changed over time and for the simultaneous effects of the other independent variables (populations, areas, distance, and indicator variables), we introduced year as an independent variable. We used the centered variable year minus 1985 to assure the stability of the model's estimated intercept.
To summarize in notation similar to that of Zipf's gravity model, let P be population, A be area, and D be distance. Detailed definitions of these variables are given in Methods. Our “starting” model was the logarithmic transformation of model Eq. 1. This logarithmic transformation guaranteed that the number of migrants estimated by the model was positive. where ε ∼ N(0,σ^{2}) are independent.
This model contains factors of the form [10 if origin is Australia; 1 otherwise]^{g}. The common logarithm of this expression is g × [1 if origin is Australia; 0 otherwise], and the expression in square brackets is an indicator (or dummy) variable. The coefficient g summarizes the nonrandom effects on the expected number of migrants from Australia (in this case) apart from those due to calendar year, the populations and areas of origin and destination, and distance. The indicator variables quantified how numbers of migrants were affected by nonrandom characteristics of individual countries: migratory history, policy, and statistical completeness; economic, geographic, and cultural affinities or disparities; and effects of geographical adjacency not captured by the chosen measure of distance.
We estimated the intercept (log constant) and exponents a, b, c, d, …, which were linear coefficients in the GLM. We fitted this and other models to data from 11 countries (Australia, Belgium, Canada, Denmark, Germany, Italy, the Netherlands, Spain, Sweden, the U.K., and the U.S.A.). These countries were selected on the basis of the quality of their data on international migrants by year and places of origin and destination from 1960 to 2004. The data included 228 origins and 195 destinations of migrants. Oceania was the only destination not also an origin, so 229 countries or regions in total were named in these data.
Because of the limited quantity and quality of migration data, this article demonstrates a method and illustrates a modeling approach, rather than specifying numerical parameters definitively. Parameters and models will evolve as more and better data become available. The analysis showed that statistically simple and demographically interpretable modeling accounted for more than half the variation in the migration data. How much more than half could be accounted for by this approach with better data remains to be determined.
Results
Descriptive Bivariate Relationships.
On average, but with enormous variability, the log number of migrants increased with increasing log population of origin (r = 0.43), increasing log area of origin (r = 0.18), increasing log population of destination (r = 0.27), and increasing log area of destination (r = 0.10) (Fig. 1 A–D). The log number of migrants decreased, on average but with enormous variability, with increasing log distance from origin to destination (r = −0.24) and increased weakly with year (r = 0.01) (Fig. 1 E and F). Log population and log area were highly correlated (Fig. 1 G and H) for origins (r = 0.74) and destinations (r = 0.64). Because these were the two highestmagnitude correlations, collinearity was not a problem in fitting the GLM. Correlations between year and log migrants, and between log(ppnorig) and log(distance), were insubstantial.
Log(ppnorig) and log(ppndest) were negatively correlated (r = −0.19) (Fig. 1I). This negative correlation could reflect the absence of data sources among countries with four million or fewer people, which may account for the absence of data points in the lower left quadrant of Fig. 1I. The long horizontal and vertical streaks in Fig. 1I represent the populations of countries that were data sources as origins and destinations, respectively, whereas the short diagonal streaks largely reflect population growth of a given (origin, destination) pair.
Model Selection.
The starting linear model was fitted [supporting information (SI) Table S1] with the dependent variable log(migrants) and with the six independent variables that we call “basic” [year minus 1985, log(ppnorig), log(areaorig), log(ppndest), log(areadest), log(distance)] and all indicator variables (orig.indicator, dest.indicator, orig.is.datasource, dest.is.datasource). The Multiple R^{2} was 0.5693 and the Adjusted R^{2} was 0.5689 (see Methods).
When the stepwise algorithm with Bayesian information criterion (15) was applied to this starting model, log(areadest) was eliminated and all other independent variables were retained in the resulting “final” model (Table 1). To the four significant digits shown, the Multiple R^{2} and the Adjusted R^{2} were unchanged between the starting and the final models.
When the values of log(migrants) were independently and randomly permuted in each of 100 simulations, the maximum of the 100 simulated multiple R^{2} values was 0.00147, far smaller than the multiple R^{2} value of 0.5693 for the data. Hence, the latter multiple R^{2} value could not have been an artifact of the fitting procedure alone.
When all indicator variables were suppressed and only the six “basic” independent variables were retained, the multiple R^{2} value dropped to 0.4345 (Table S3). The only notable change in the coefficients of the six “basic” independent variables was the increase in the coefficient of log(areadest) from 0.0239 to 0.1604. A model that omitted the indicator variables would have misleadingly suggested that log(areadest) was fairly influential. The stepwise algorithm, however, retained the indicator variables and dropped log(areadest).
Residuals.
In the final model, the interquartile range of the residuals was −0.4352 to 0.4414 log(migrants), meaning that half the time, the observed numbers of migrants fell in the interval from 36.7% to 2.763 times the predicted number of migrants (the predicted number was 10^{expected log(migrants)}). The smallest and largest residuals were −3.2449 and 3.2918, corresponding to cases where the observed number of migrants was >1,000 times smaller or larger than the predicted number.
The largest residuals occurred at intermediate fitted values from 0.5 to 3.5, corresponding to predicted numbers of migrants from ≈3 to 3,000 (Fig. S1A). The scatter of the residuals was clearly not constant over the range of fitted values. This lack of homoscedasticity justified the use of the Bayesian information criterion for model selection instead of a probabilistic interpretation of F tests for omitted variables. The validity of the latter approach assumes homoscedasticity of residuals and independence of observations.
When the fitted values were >4 on the log_{10} scale (corresponding to 10,000 migrants per year or more), residuals were systematically negative, indicating fewer reported migrants than predicted. This pattern could result, among other possible causes, from underreporting of large migrant flows or from systematic policies intended to diminish the largest predicted flows.
Model Coefficients.
In the final model (Table 1), the predicted number of migrants increased by 0.38% = 10^{0.00163} per year, in addition to the changes in numbers of migrants resulting from changes over time in log(ppnorig) and log(ppndest). The predicted number of migrants was proportional to the population of origin raised to 0.86 and to the population of destination raised to 0.36. Increases in log(ppnorig) increased log(migrants) more than twice as much as increases in the log(ppndest). In light of the small standard errors of these estimates (Table 1), these exponents very probably differed from the exponents of 1 in Zipf's (12) gravity model, even if the distributional assumptions of the linear model were not precisely met. The predicted number of migrants was proportional to the distance raised to −0.97, a bit more than three standard errors from Zipf's posited exponent of −1. The predicted number of migrants was proportional to the area of origin to the power −0.21. Because P^{0.86}A^{−0.21} = P^{0.65}(P/A)^{0.21}, the number of migrants increased with both the population of origin (to the power of 0.65) and the population density of origin (to the power 0.21), and the population of origin contributed more to the number of migrants than did the population density of origin.
The indicator variables revealed substantial heterogeneity among countries in their propensity to send or receive migrants and in their reporting practices, given the other independent variables. According to its coefficient for orig.indicator, Australia had 13.47 = 10^{1.1295} times as many emigrants as expected. At the opposite extreme, Belgium had 56% as many emigrants as expected from its other characteristics. Denmark, Germany, and the Netherlands had approximately as many emigrants as expected. According to its coefficient for dest.indicator, Australia had 27.31 times as many immigrants as expected from its geographic and demographic characteristics, followed by the U.S.A. and Canada with 13.95 and 7.17 times as many immigrants as expected, respectively. At the opposite extreme, Belgium had 1.35 times as many immigrants as expected, the smallest multiple among the countries that provided data in this study. According to dest.indicator, all reporting countries had more immigrants than expected on average. The countries that provided data for this study are countries with the resources to support effective statistical systems, and such countries are likely to be attractive destinations of migration.
According to its coefficient for orig.is.datasource, when Australia reported the number of emigrants from Australia, it reported on average 50% = 10^{−0.3038} of the number of emigrants from Australia that the average country that reported emigration data in this study would have reported, given Australia's population and area, the destination and year, and the propensity to emigrate from Australia estimated without regard to the source of the data. For example, Australia reported 3,971 migrants from Australia to the U.K. in 1998, whereas the U.K. reported 41,800 migrants from Australia to the U.K. in 1998 (see Methods). Australia reported emigrants leaving permanently. The U.K. reported immigrants staying for 1 year or longer.
At the opposite extreme from low reporters of emigration like Australia (50%) and Italy (33%), the U.K. reported 22.54 times as many emigrants as would be expected otherwise, whereas Germany reported 3.07 times as many. The U.K. data are estimates derived mainly from a survey of arriving and departing international passengers; U.K.'s recording of emigration and immigration flows is equally complete. Australia, by contrast, focuses on entries. These differences between countries are detected by the statistical analysis and reflected in the estimated coefficients of the model.
According to its coefficient for dest.is.datasource, when the U.K. reported the number of immigrants to the U.K., its reports were on average 31.47 times the number of immigrants that would have been reported by the average reporting system over all countries in the study, given the U.K.'s population and area, the destination and year, and the propensity to immigrate to the U.K. estimated without regard to the source of the data. At the opposite extreme, Spain and Italy reported 59% and 58%, respectively, of the numbers of immigrants expected from other factors.
In general, immigration is better recorded at destinations than emigration is recorded at origins, in part because migrants often have more formal incentives to register at their destination than to deregister at their origin. The estimated coefficients of the indicator variables (Table 1) are consistent with this belief, although other interpretations are possible. For example, dest.indicator for Australia had coefficient 1.4362, greater than the coefficient 1.1295 of orig.indicator for Australia. The same inequality, dest.indicator coefficient > orig.indicator coefficient, held for all eight countries that supplied both immigration and emigration data. Similarly, the comparable inequality, dest.is.datasource coefficient > orig.is.datasource coefficient, held for all eight countries that supplied both immigration and emigration data. These inequalities indicated greater detection (or greater intensity, an alternative interpretation) of immigration than emigration in every country for which the comparison could be made.
Parameter Stability: How Much of the Past Is Relevant to the Future?
Estimated coefficients varied systematically as a function of the time interval from which data were drawn and as a function of the subset of variables selected from the starting model Eq. 1 (Table S3). The starting model rather than the final model was used for this analysis to allow for the possibility that the log area of destination might become dramatically important for some subset of 1960–2004. As it turned out, this possibility did not occur.
For each set of independent variables, the most recent five years (2000–2004) gave the highest multiple R^{2}. Using all variables, all data from 1960 to 2004 gave a multiple R^{2} value of 0.57, whereas the 2000–2004 data gave a multiple R^{2} value of 0.64 (Table S3). This higher value of the multiple R^{2} may be partially due to improved quality of data in more recent years but may also be due to fewer external perturbations to migratory flows during 2000–2004 than during 1960–2004. For example, during the 45year period of 1960–2004, the Berlin wall and the Soviet Union fell, and Germany was reunified, whereas no such events marked the 5year period of 2000–2004, an interval only oneninth as long. For each set of independent variables, as the initial year moved forward while the terminal year was 2004, the multiple R^{2} value increased. Each set of independent variables accounted for more variation in log(migrants) when using the data from 1985 to 2004 than by adding additional data from earlier years.
All estimated coefficients were stabler when using a moving initial year than when using a moving terminal year. They were least stable when using moving tranches of 5 or 10 years' duration (Table S3). For the starting model with all variables, the standard deviation 0.3877 of the estimated intercept for the five time intervals with moving terminal year 1960–1984, 1960–1989, 1960–1994, 1960–1999, and 1960–2004 was larger than the standard deviation 0.2489 of the estimated intercept for the five time intervals with moving initial year 1965–2004, 1970–2004, 1975–2004, 1980–2004, and 1985–2004. The same inequality held for the standard deviations of the estimates of the coefficients of all six basic variables when the five time intervals with moving terminal year were compared with the five time intervals with moving initial year.
Discussion
We assembled annual data on immigrants and emigrants from 11 countries' sources and combined them with data on the populations and areas of 228 origins and 195 destinations and the distances between origins and destinations. These 43,653 reports did not suffice to cover the 228 × 195 = 44,460 possible origindestination pairs in a single year and offered very sparse coverage over 45 years. A simple GLM was able to account for more than half the variation in log(migrants). Despite the present limitations of data, this approach may improve on existing demographic procedures for projecting international migration and may motivate the collection of better data.
The bivariate relations among variables demonstrated the need for a multivariate model. For example, the number of migrants increased with the population of origin and the population of destination, but the population of origin and the population of destination were negatively correlated. Only a multivariate model could reveal how the number of migrants depended on the populations of origin and of destination jointly.
In the final GLM, the coefficients of log(ppnorig) and log(ppndest) were both positive and <1. If either coefficient had been negative, then the estimated number of migrants could have diverged to infinity as the population of origin or destination became smaller. Because the coefficients were <1, the numbers of migrants did not increase in proportion to ppnorig or ppndest.
The GLM served two distinct purposes: understanding and prediction. For scientific understanding, we eliminated superfluous independent variables to obtain the most economical model, then interpreted the values of the coefficients in the model. For prediction, we sought as much predictive power as possible by adding variables that gave the highest coefficient of determination, provided that the parameter estimates did not become unstable (sensitive to the inclusion or exclusion of a small number of data points and/or other predictor variables). The starting and final models considered here balanced interpretability and predictive ability. Other models are discussed in SI Text (Table S2).
The principal problems with this method were the lack of data and the lack of comparability (discrepancies between countries in definitions and measurements) where the data existed (Methods). Most countries lack a system to record migration flows. Many do not publish their information on migration. The data did not consistently distinguish moves from movers (6); individuals who crossed borders multiple times may have been interpreted as multiple migrants.
Comparisons with Some Related Studies.
The 2003 Technical Panel on Assumptions and Methods of the Social Security Administration (16) suggested assuming that the number of net migrants to the U.S. will grow, at least in the long run, in direct proportion to the size of the U.S.A. population. The coefficients in the final model (Table 1) suggested, by contrast, that immigration to the United States is expected to grow in proportion to the population of the United States raised to the power 0.36, times independent multiplicative effects of the calendar year. The final model also anticipates changing log(migrants) as a result of population growth in countries of origin. Likewise, according to the final model, emigration from the United States should be expected to grow in proportion to the population of the United States raised to the power 0.86, times multiplicative effects that depend on year and countryofdestination populations.
If countries' populations changed by a factor of λ > 0, holding constant all other variables and GLM coefficients, the number of migrants would be multiplied by a factor of λ^{a+b} because (λ·ppnorig)^{a}(λ·ppndest)^{b} = λ^{a+b}·ppnorig^{a}·ppndest^{b}, where a + b is the sum of the coefficients of log(ppnorig) and log(ppndest). In the final model (Table 1), a + b = 1.22; so if ppnorig and ppndest both doubled, the predicted number of migrants from origin to destination would increase by a factor of 2^{1.22} = 2.34. For moving time intervals in the starting model with all variables (Table S3), a + b tended to increase with the moving initial year or moving final year or tranche. For example, for data in the time intervals 1980–2004 and 1985–2004, a + b = 1.26 and 1.30, respectively.
Bijak et al. (17) projected migratory flows among 27 European countries by multiplying the initial emigration rates by an overall trend (mobility increasing by 0.5% yearly) and temporal effects of labor market policies. For comparison, our final model estimated that the global number of migrants rose by 10^{0.00163} = 1.0038 = 0.4% per year in the data of 1960–2004, apart from the multiplicative effects of population growth or decline and the indicator variables of the countries or regions of origin and destination. This agreement in estimates is remarkable considering the difference in methods, data, and context (Europe versus the globe). Raymer (18) reconstructed the migratory flows for the European Union using a different loglinear model with multiplicative components.
Some migration models are based on transition probability matrices (6). In such models, the number of immigrants is independent of the destination's population and proportional to a weighted sum of the populations of origins or of the fractions of global population in different origins. The GLMs estimated here suggest that each destination's number of immigrants is a nonlinear function of the populations of both origin and destination and of other variables.
Use in Population Projection.
The projected number of migrants can be embedded in a population projection algorithm, initially ignoring age structure and then incorporating it. The initial goal is, given a vector of country population sizes P(t) with elements P(i,t) for country i at time t, to compute the population vector P(t+1) at the next time step. The GLM can estimate the number of migrants M(i,j,t) from country i to country j between t and t+1. Let M(i,i,t) = 0 for all i (despite some countries' reporting positive numbers of migrants from the country to itself). The matrix M(t) with elements M(i,j,t) is called the migration matrix at time t. It will be assumed that M(i,j,t) obtained from the GLM approximates the number of people who were in country i at time t and in country j (≠i) at time t + 1.
The number of emigrants from country i between t and t+1, denoted E(i,t), is then E(i,t) = Σ_{j}M(i,j,t). The vector E(t) with elements E(i,t) is called the emigration vector at time t. The number of immigrants to country i between t and t + 1, denoted I(i,t), is I(i,t) = Σ_{h}M(h,i,t). The vector I(t) with elements I(i,t) is called the immigration vector at time t. The number of net migrants to country i between t and t + 1, denoted N(i,t), is N(i,t) = I(i,t) − E(i,t). The vector N(t) with elements N(i,t) will be called the net migration vector at time t. Then populations of countries or regions can be projected as where B(i,t) is the number of births and D(i,t) is the number of deaths projected for country i from separate models of fertility and mortality.
It has been assumed thus far that the elements of the migration matrix would be the predicted values 10^{expected log(migrants)} produced by the GLM, yielding a deterministic population projection. A stochastic (Monte Carlo) method would be to sample numerically from the distribution of residuals for given values of the independent variables. Then the number of migrants from an origin to a destination would become a random variable, and a population projection that incorporated the migration matrix would become a probabilistic ensemble of projections.
The migration matrix at time t depends on the indicator variables, which are estimated from data up to and including t. Given P(t+1) from Eq. 2, the simplest approach to computing the migration matrix at time t + 1 would be to keep the coefficients of the indicator variables constant and to update only the population vector P(t + 1). A more sophisticated approach would be to examine trends over time in the coefficients of the indicator variables and to extrapolate those trends forward to obtain updated coefficients for the indicator variables of future migration matrices.
To obtain an agespecific net migration vector, one could apply agespecific models of migration patterns for different countries and regions to E(i,t) and to I(i,t) (19–21) and combine that with the projected agespecific fertility and mortality vectors as in Eq. 2.
This method of projecting international migration in combination with cohortcomponent methods of projecting fertility and mortality solves the two problems identified by the United Nations Population Division (1). First, the global sum of net migrants is guaranteed to be 0 when the numbers of emigrants, immigrants, and net migrants are derived from a migration matrix (22). Second, net emigration should not completely deplete the population of any sending country for realistic model parameters. In all models considered here, the exponents of the populations of origin and destination are positive and the predicted number of emigrants is a small fraction of the population of origin. Consequently, the projected number of emigrants from an origin declines to 0 as its population declines to 0. Likewise, if the destination's population declines, the models predict that fewer people will migrate there. This feature of the model depends on realistic parameter values and may not hold in all mathematically possible cases. If the coefficients of the model were unrealistic and gave an unrealistically large number of emigrants, then it is possible that the origin population could be depleted.
Open Research Questions.
Many open questions remain. Why did the area of the destination influence the numbers of migrants much less than the area of origin? How well would the proposed models work for migration within a country, taking account of international migration? For a given origindestination pair, were the residuals correlated over time or independent as the error term in the model assumed?
International migrants are mainly younger individuals of working age and their families. Migratory flows of elderly individuals may also be important. Would the fit of the models be improved by replacing total population size with a weighted average that emphasized age groups most prone to migrate, or by a simple index such as the proportion of the population age 20–34 years? Agestructure seems likely to matter to migration increasingly as all countries undergo population aging (23).
What is the longrun behavior of the projection model Eq. 2 assuming constant birth rates and death rates and constant coefficients in Eq. 1? For example, when, if ever, does there exist a fixed vector P of population size by country and a stable growth rate ρ such that lim_{t → ∞} P(t)/ρ^{t} = P? If this case arises, how does ρ depend on the parameters of the basic model Eq. 2? What irreducibility conditions on the migration matrix assure the uniqueness of P? When the migration matrix is reducible, could different “ergodic sets” of countries (sets of countries linked through migration flows) have different fixed vectors P of population size by country and different stable growth rates ρ? Under what conditions on the coefficients of log(ppnorig) and log(ppndest) can country populations snowball to infinity (in finite time or with infinite time) or vanish? How sensitive to the initial conditions P(i,0) are each country's proportion of world population P(i,t)/Σ_{h}/P(h,t) for large t, where the summation runs over all countries? How sensitive are countryspecific population growth rates P(i,t+1)/P(i,t), for large t, to initial conditions? In short, what ergodic theorems hold (24)?
Methods
Data.
All 43,653 data records are provided in Dataset S1. Each record contains 12 variables: a unique serial number, the year in the Western calendar (1960–2004), the name of the country or region of origin of migrants, the log population of the origin in that year, the name of the country or region of destination of migrants, the log population of the destination in that year, the log number of migrants from origin to destination in that year, the log area (square kilometers) of the origin, the log area (square kilometers) of the destination, the log great circle distance (kilometers) from the capital of the origin to the capital of the destination, the source of the migration data, and “neighbor” (see SI Text). Records for which the value of any variable was missing were excluded.
Each country's definitions of what constituted a migrant, of the origin or destination of a migrant, and of the accounting year were used (Table S4). Differences among definitions and in the effectiveness of collecting migration data led to hundreds of discrepancies when both the origin and the destination reported migration data in the same year. In SI Text, sources are listed and methods of collecting migration data are discussed.
Data Analysis.
A GLM was fitted to a starting model with dependent variable log(migrants) and with all six basic independent variables [year minus 1985, log(ppnorig), log(areaorig), log(ppndest), log(areadest), and log(distance)] and all indicator variables (orig.indicator, dest.indicator, orig.is.datasource, dest.is.datasource). The stepwise regression algorithm stepAIC was applied to this linear model to obtain a final model. Details of data management and statistical software are in SI Text.
Acknowledgments
Earlier versions of this article were presented to the Technical Working Group on LongRange Population Projections, Population Division, United Nations Headquarters, on June 30, 2003; the United Nations Population Division Seminar on October 21, 2004; the Stanford Workshop on Formal and Quantitative Demography on August 15, 2005; the Policy Research Division of the Population Council on January 23, 2006; and the Estimates and Projection Section of the Population Division of the United Nations on June 6, 2008. We thank for helpful comments the many participants in those presentations and Adam E. Cohen, Jakub Bijak, Joshua R. Goldstein, Ryuichi Kaneko, Ronald D. Lee, Jim Oeppen, and Hania Zlotnik. We thank Mr. and Mrs. William T. Golden and family for their hospitality during this work. Priscilla K. Rogerson assisted in preparing data and the manuscript. J. Bijak and R. Kaneko helpfully reviewed two versions of the article. This work was supported by U.S. National Science Foundation Grants DEB9981552 and DMS0443803.
Footnotes
 ^{†}To whom correspondence should be addressed. Email: cohen{at}rockefeller.edu

Author contributions: J.E.C. designed research; J.E.C. performed research; J.E.C., D.C.R., and C.G. contributed new reagents/analytic tools; J.E.C. and M.R. analyzed data; and J.E.C., M.R., and D.C.R. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0808185105/DCSupplemental.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 United Nations Population Division
 ↵
 Fertig M,
 Schmidt CM
 Djajic S
 ↵
 Howe N,
 Jackson R
 ↵
 Bijak J
 ↵
 Raymer J,
 Willekens F
 ↵
 ↵
 ↵
 Massey DS,
 et al.
 ↵
 Faist T
 ↵
 ↵
 ↵
 ↵
 Zipf GK
 ↵
 McCullagh P,
 Nelder JA
 ↵
 ↵
 Technical Panel on Assumptions and Methods
 ↵
 ↵
 Raymer J
 Raymer J,
 Willekens F
 ↵
 Rogers A,
 Castro LJ
 ↵
 Rogers A,
 Willekens F,
 Raymer J
 ↵
 ↵
 ↵
 Raymer J,
 Rogers A
 Raymer J,
 Willekens F
 ↵
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Related Content
Cited by...
 Matchmaker, Matchmaker, Make Me a Match: Migration of Populations via Marriages in the Past
 Tracking job and housing dynamics with smartcard data
 Probabilistic population projections with migration uncertainty
 Bayesian probabilistic population projections for all countries
 The Outlook for Population Growth