Probabilistic projection of subnational total fertility rates

BACKGROUND We consider the problem of probabilistic projection of the total fertility rate (TFR) for subnational regions. OBJECTIVE We seek a method that is consistent with the UN’s recently adopted Bayesian method for probabilistic TFR projections for all countries and works well for all countries. METHODS We assess various possible methods using subnational TFR data for 47 countries. RESULTS We find that the method that performs best in terms of out-of-sample predictive performance and also in terms of reproducing the within-country correlation in TFR is a method that scales each national trajectory from the national predictive posterior distribution by a region-specific scale factor that is allowed to vary slowly over time. CONCLUSIONS Probabilistic projections of TFR for subnational units are best produced by scaling the national projection by a slowly time-varying region-specific scale factor. This supports the hypothesis of Watkins (1990, 1991) that within-country TFR converges over time in response to country-specific factors, and thus extends the Watkins hypothesis to the last 50 years and to a much wider range of countries around the world. CONTRIBUTION We have developed a new method for probabilistic projection of subnational TFR that works well and outperforms other methods. This also sheds light on the extent to which within-country TFR converges over time.


Introduction
Goli, 2012; C. Wilson et al., 2012), while the evidence is more equivocal for the United States (O'Connell, 1981). Watkins posited that this was due to increased integration of national markets, expansion of the role of the state, and nation-building in the form of linguistic standardization over this period. Calhoun (1993) argued that, of these three mechanisms, only linguistic standardization clearly supported her argument. However, some support for the importance of the role of the nation state for fertility is provided by the fact that nation states have specific and different policies aimed at affecting fertility rates (Tomlinson, 1985;Chamie, 1994), and some of these policies have been shown to be effective (Kalwij, 2010;Luci-Greulich & Thévenon, 2013). Note that Klüsener et al. (2013) investigated subnational convergence of non-marital fertility in Europe in recent decades, and found that withincountry variation increased, in contrast with the trends noted by other authors for overall fertility. Here we consider only overall fertility.
One question is then whether the direct extension of the UN method for countries to the subnational context adequately accounts for this tendency of TFR to converge within countries over time. Note that this extension of the UN method does predict within-country convergence of fertility rates over time during the fertility transition; the question is whether it adequately accounts for this convergence.
To investigate this question, we consider a different general approach, which starts from the national probabilistic projections produced by the UN method, and then scales them for each region by a scaling factor that varies stochastically, but stays relatively constant. This induces more within-country correlation than the direct extension of the UN method. It could be viewed as a probabilistic extension of the method currently used by the U.S. Census Bureau. It is also related to the method of T. Wilson (2013), but with some significant differences. We apply these methods to subnational data on total fertility for 47 countries over the period 1950-2010. We compare our two approaches and several variants in terms of out-ofsample predictive performance. The results shed some light on the Watkins hypothesis of increasing within-country correlation, as well providing some guidance on how to carry out subnational probabilistic TFR projection.
Note that there is a substantial literature on convergence of fertility rates in different countries to one another, with different conclusions argued for (C. Wilson, 2001Wilson, , 2004Reher, 2004Reher, , 2007Dorius, 2008;C. Wilson, 2011). Our work here has implications for withincountry fertility convergence, but is agnostic about fertility convergence between countries, and so does not have implications for global fertility convergence, for example.
The paper is organized as follows. We first describe the data used in this study and review the model for national probabilistic projections. We then introduce our proposed  methodology for subnational probabilistic projections, and present the results. The paper concludes with a discussion.

Data
We use subnational data on the TFR for 47 countries (13 in the Americas, nine in the Asia-Pacific region, and 25 in Europe), corresponding to 1,092 regions for the period 1950-2010, collected by the United Nations Population Division. Each country analyzed had a population over one million and a national average TFR below 2.5 in 2010-2015. The geographical level selected for each country was the one with available data for the longest comparable time series. The dataset covers 4.9 billion people. Fig. 1 shows the numbers of regions for each country, which range from two for Slovenia to 96 for France. The data include countries from all the inhabited continents except Africa. The data sources are shown in Appendix Table 5. It illustrates that the data vary with respect to the correlation between regions. It also shows that the data started later than 1950 for some regions. In the figure, the national TFR from United Nations (2013) is shown as a black curve.

Denmark
Time TFR q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Pre−transition phase I (not modeled) Transition phase II Post−transition phase III Cartoon of a double logistic decline curve for country c with its parameters defining the shape. f c,t on the x axis denotes the TFR; g(θ c , f c,t ) on the y axis denotes the first order difference in TFR.

Review of the national Bayesian hierarchical model
Our starting point for developing a methodology for subnational projections is the probabilistic model for projecting national TFR proposed by Alkema et al. (2011), which has now been adopted by the UN for its official projections. We start by summarizing the main ideas of this Bayesian hierarchical model (BHM). More detail can be found in Alkema et al. (2011) and Raftery et al. (2014). To model the fertility declines in each five-year period during Phase II, a double logistic decline function is used. An example of this function is shown in the right panel of Fig. 3.
The function is parametrized by a set of country-specific parameters that define the shape of the country's decline curve. Those parameters are drawn from a world distribution. The resulting BHM is estimated using Markov chain Monte Carlo (MCMC).
Phase III is modeled using a Bayesian hierarchichal first-order autoregressive, or AR(1), process of the form: It implies that fertility for country c has a country-specific long-term mean, µ c , and autoregressive parameter, ρ c , which are assumed to be drawn from a world distribution. The parameters of this world distribution in turn have a joint prior distribution, thus defining a three-level hierarchical model, where the three levels are the observation, the country and the world. The resulting model is again estimated by MCMC.
The process of estimating Phase II and Phase III parameters results in a set of countryspecific decline curves and a set of country-specific AR(1) parameter pairs. Unlike decline curves which can be estimated for all countries, not all countries have experienced Phase III, in which cases the country-specific long-term means and autoregressive parameters cannot be estimated. In such cases, the "world" means and autoregressive parameters are used. The estimated parameters are then used to generate a set of future TFR trajectories yielding probabilistic TFR projections for all countries of the world.

Methods for subnational projections
Ideally, we seek a method for generating probabilistic subnational TFR projections that reflects the literature and theory of fertility transitions, is based on the national methodology used by the UN and described above, works well for all countries, is as simple if possible, and yields correlations between regions that are similar to the correlations in the observed data.
We first describe a simple Scale method that provides an initial probabilistic extension of methods used by the U.S. Census Bureau and other national agencies. This simple approach works well from many points of view, but it does not allow for the possibility of crossovers between regions, whereas in fact these do happen. We therefore elaborate this model to allow the scale factor to change stochastically, but slowly over time, yielding the so-called Scale-AR(1) method. Finally we describe a quite different approach, called the one-directional BHM, which directly generalizes the national approach to the subnational context, allowing regions to vary more freely within a country.

Scale Method
We start with a simple intuitive scale method where, for each trajectory from the probabilistic projection, the regional TFR is simply a product of the simulated national TFR and a timeindependent but region-specific scale factor.
Let f c,t,i denote the national TFR projection for country c at time t from trajectory i, simulated from its posterior distribution as described above. We model f rc,t,i , the TFR for region r c of country c at time t in the i-th trajectory, by where α rc denotes the regional scaling factor derived from the last observed (present) time period denoted by P : Note that α rc is the same for all trajectories. This method yields a set of regional trajectories f rc,t,i and thus yields probabilistic projections of the regional TFRs, f rc,t . Our numerical experiments, described below, indicated that this simple method performed surprisingly well. However, it also has a serious drawback. Scaling by a constant factor yields a perfect correlation, i.e. it does not allow for the possibility of crossovers between regions over time. However, such crossovers do happen, and the scale method says that they are impossible, which is not fully satisfactory.

Scale-AR(1)
To avoid this drawback, and modify the scale method so as to allow for the possibility of crossovers, we propose a variation of the simple Scale method where we model the regional scale factor using a first-order autoregressive, or AR(1), process: The regional TFR f rc,t,i is then derived as in (1) with the additional lower bound restriction, f rc,t,i > 0.5.
This model implies that the scaling factor will fluctuate around one in the long term.
Regardless of its initial value, it will converge to a distribution that is centered around one, and the rate of convergence is determined by the φ parameter. We use the following settings for the model parameters, estimated from the data for all 47 countries available: where P again denotes the present time period and R c denotes the set of regions of country c.
The minimum restriction in (5) ensures that the variation of α ·,t across regions is not larger than the variation in the last observed time period, in line with the Watkins hypothesis and the long-term observed data. Details of how these parameters were estimated are given in the appendix.
This method is related to the method proposed by T. Wilson (2013), but there are some significant differences that are discussed in the Discussion section.

One-directional BHM
Next, we consider an extension of the world three-level BHM which is depicted in Fig. 4. In the world version, information from all countries is combined into the world level, which in turn influences the country level, yielding a two-directional BHM. The prior distribution of the hyperparameters is vague for most parameters, but also reflects expert knowledge in some cases. The model yields a posterior distribution of the world parameters, and the country-specific parameters, which is then used to generate the national projections.
Our extension has a similar setup, but moves down by one level of geography and works in one direction only. Thus the top level of our national model is the country, the next level is the region, and the bottom level is the time point. The upper level of our model corresponds to the country level of the world model, that is, we carry over the country-specific posterior from a world simulation and use it as the distribution of the hyperparameters in our national model (red arrow in Fig. 4). On the lower level, data from all regions of a country are handled individually. The estimation of the regional parameters is informed by the hyperparameters, but the regional level does not influence the country level of the model. The resulting regional posterior distribution is used to project subnational TFR.
Note that many countries do not have historical data on Phase III because they have not yet reached this stage, and so in these cases the country posterior is the same as the world posterior. As a result, all regions of those countries inherit the "world" Phase III parameters.

Correlation between regions
For aggregating TFR over sets of regions, for example for deriving country's averages, it is important to capture correlation in model errors between regions, as was done by Fosdick & Raftery (2014) for capturing correlation between countries.
We will model the forecast errors as follows: where σ t is a vector consisting of the forecast standard deviations for each region. For Phase II this is the standard deviation of the errors in the double logistic model and for Phase III it is the standard deviation of the AR(1) model. In (7), A is a matrix where each element A r,s corresponds to the correlation between the model errors of region r and s over all time periods.
Let f r,t denote the observed TFR for region r at time t. We denote by e r,t the normalized forecast error, namely the forecast error divided by its standard deviation. The normalized forecast error e r,t is estimated as follows: • Phase II: For each value g r,t,i of a double logistic (DL) trajectory i and the standard deviation of DL σ r,i take d r,t,i = (f r,t − g r,t,i )/σ ri . Then e r,t is the mean of d r,t,i over i.
• Phase III: For each value h r,t,i of a phase III trajectory i and the standard deviation of these trajectories σ ε,r,i , take the difference d r,t,i = (f r,t − h r,t,i )/σ ε,r,i . Then, e r,t is the mean of d r,t,i over i.
We define the correlation matrix A as whereÃ is a truncated correlation matrix made positive definite, andT is the average number of time periods per region. Here A has an approximate Bayesian interpretation as an approximation of the posterior mean with a uniform U [0, 1] on the correlations. Note that A is positive definite. The appendix contains details of the method, as well as other methods for deriving A that we have experimented with.

Results
We now compare results from the three methods described in the previous section. All three methods depend on a national BHM simulation. We used a simulation that was used to produce the official UN TFR projections in WPP2012. Our version has 2,000 TFR trajectories for each country and was produced using the bayesTFR R package (Ševčíková et al., 2011).
For the Scale-AR(1) method, for each region r c we set the initial scaling factor to α rc,P = f rc,P /f c,P with P being the last observed time period. Then we produced projections of α rc,t for t > P using (3). Finally, (1) was applied, as in the case of the simple Scale method, using each of the 2,000 TFR trajectories for country c as f c,t,i . This yielded 2,000 regional TFR trajectories.
For the one-directional BHM (1d-BHM), we ran the regional BHM while using the country posterior from the national BHM simulation. Then we projected 2,000 regional TFR trajectories using a sample of the regional posterior parameters. We explored two versions of this model, one that accounts for correlation between regions' error terms and one that does not, the latter denoted by "1d-BHM (indep)".

TFR projections
We are interested in the marginal predictive distribution of future TFR for each region.
We are also interested in how reasonable the joint distributions of the trajectories between regions are. The bottom two panels of Fig. 5 show results from the one-directional BHM method. In the middle panel we accounted for correlation between regions, whereas in the bottom panel the regions' error terms were considered independent. As can be seen, this method does not yield trajectories that closely parallel the national one. Furthermore, if correlation is not taken into account, there are many more crossovers between regions than are typically seen in the past data.
All the 47 countries in our dataset show the same pattern in terms of the differences between the methods. In Fig. 6 we selected three countries for which one trajectory obtained via the Scale-AR(1) method is shown for each region (as in the top panel of Fig. 5). As in the case of Sweden, the trajectories are highly correlated and closely follow the national trajectory.
In Fig. 7 we show the predictive median and 80% prediction interval (red) for three regions of India from the Scale-AR(1) method (the corresponding national projection is shown in grey). They represent three different types of regions found across all countries. The first type (in the left panel, Assam, India) is a region with a current TFR that is very close to the national TFR. In such a case, the regional projection mostly overlaps with the national projection, with a slightly larger prediction interval. The black dotted line in the figure shows the median projection resulting from the simple Scale method. This would also be very close to the national median for regions of this type.
Uttar Pradesh in the center is a type of region where current TFR is substantially higher than the national TFR. The underlying AR(1) process causes the median projection of such a region to converge to the national median in the long term, thus decreasing the gap between them. If simple scaling were applied, that gap would remain constant, resulting in much higher projections of TFR for the region. Finally, Goa on the right, with its current TFR well below the national one, is projected to increase on average, again yielding a smaller gap between the national and regional medians.
Here simple scaling results in much lower projections.

Out-of-sample predictive validation
We validated our methodology via predictive out-of-sample experiments, one for predicting the period 1995-2010, and one for predicting the period 1990-2010. We first assessed the various methods in terms of average predictive performance over all regions of the 47 countries. To assess their performance for predicting aggregates (and hence, for example, in capturing the between-region correlations), we further assess the predictions of the average TFR across the regions of each country.
For both time periods considered, we removed the data points that corresponded to the time period to be predicted, reestimated the models, generated probabilistic projections with the various methods, and compared the projections with the observed data points.
The results are shown in Table 1 for 1995-2010 and Table 2  For comparison purposes, we also added the Persistence method, in which the TFR stays at the same level over time and so the forecast for all future time periods is equal to the last observed value. While this could be viewed as a straw man forecast, persistence forecasts have been found to perform surprisingly well in many forecasting contexts, and so it is worth making this comparison.
In the tables, the mean absolute error (MAE), the bias and the continuous ranked probability score (CRPS) (Hersbach, 2000;Gneiting & Raftery, 2007) are reported. The coverages of the 80% and 95% intervals are also reported. The coverage of a prediction interval is defined as the proportion of the time that the truth lies in the interval. We wish the coverage to be close to the nominal level. Thus, for example, ideally the coverage of the 80% interval would be close to 80%.
The appendix gives details of the derivation of these metrics. For MAE and bias, the smaller the absolute value the better. For the two coverage columns an ideal method would match the numbers to the corresponding percentage. The CRPS is a combination of an   error-based and a variation-based measure, and thus we give it a high weight when selecting the best method. In this case, a better method corresponds to a larger value of CRPS.
For the marginal TFR, the Scale-AR(1) method performs best in terms of CRPS, MAE and coverage. The simple Scale method comes in second. However, we would not recommend using the simple Scale method because it produces trajectories that are unrealistic in that they do not allow the possibility of crossovers between regions, as mentioned previously.
Note that by design, the Scale-AR(1) method yields larger uncertainty than the simple Scale method, which in this case translates to a better coverage and CRPS. The Scale method includes only the uncertainty from the national BHM model, whereas the Scale-AR(1) method has in addition the uncertainty included in the AR(1) process. There is essentially no difference between the 1d-BHM with and without correlation for the marginal TFR. This is expected, as the correlation plays a role only in aggregated indicators.
For the average TFR, the Scale-AR(1) and 1d-BHM have similar performance in terms of CRPS (one is better in Table 1, the other in Table 2). However, Scale-AR(1) has consistently better coverage. Here we see a big difference in coverage between the two versions of 1d-BHM, which does not have good performance if correlation between regions is not taken into account. The good performance of the Scale-AR(1) method suggests that it is accounting adequately for between-region spatial correlation.

Discussion
We have developed several methods for subnational probabilistic projection of TFR, and applied them to data from 47 very diverse countries. All the methods take the national projections from the UN method as their starting point. We found that all the methods we propose performed well in terms of out-of-sample predictive performance, and outperformed a simple baseline persistence method.
In the best method, the national trajectories are scaled by a region-specific scaling factor which itself is allowed to vary stochastically but slowly over time. One competing method treats the regions in the same way as countries are treated in the UN's BHM, but this does not yield enough within-country correlation. Even when we introduce between-region correlation into this model, it still does not have enough within-country correlation overall.
We have compared several different methods, but there are still others in the literature. Rayer et al. (2009) considered ex-post assessment of predictive uncertainty for U.S. counties, extending the national ex-post approach of Keyfitz (1981) and Stoto (1983) to the subnational context. Raymer et al. (2012) used a vector autoregressive model for crude birth rates in three regions of England. While these methods may work well for developed countries that have had low fertility for an extended period, they do not capture the systematic variation in fertility decline rates among higher-fertility countries documented by Alkema et al. (2011), and so they may not be so appropriate for our goal here, of developing a method applicable to countries at all levels of the fertility transition.
The extant method closest to our preferred Scale-AR(1) method is one proposed by T. Wilson (2013), who also proposed scaling a national TFR forecast by a region-specific scale factor that varies according to an AR(1) model, and applied it to Sidney, Australia. However, there are several differences between the Scale-AR(1) method we propose here, and Wilson's approach for TFR. The national TFR forecast used by Wilson is based on an AR(1) process centered around an externally-specified main forecast. As discussed, this may not carry over well to higher-fertility countries. Our method, in contrast, is centered around the probabilistic forecast from the UN's BHM, which is designed to work well for countries at all fertility levels and includes uncertainty about national projections. Also, in our method the model is statistically estimated, while in Wilson's approach the parameters are adjusted manually.
Our preferred Scale-AR(1) method does not incorporate spatially-indexed between-region correlation. Instead, spatial correlation is modeled by a strong country effect. Our 1-d BHM method did incorporate spatial correlation in the variant that includes between-region correlation estimated from the data (especially methods 8-11 described in the Appendix section on estimating the error correlations). However, this did not allow us to include enough between-region correlation. This may be because within-country correlation seems to be dominated by a strong country effect rather than spatially indexed correlation, as can be seen for example for Sweden in Fig. 5. This is also shown by the good calibration of the Scale-AR(1). Thus we feel it is likely that adding additional spatial correlation would not substantially improve fit of the model to the data at hand.
In addition to providing guidance for subnational projections, our results give insight into how subnational fertility evolves in a modern context. They suggest that there is substantial within-country correlation and convergence. This confirms the observations and hypotheses of Watkins (1990Watkins ( , 1991 for Europe to 1960. It further extends them from just Europe to a range of countries from around the world, and indicates that, broadly speaking, similar patterns continue to hold a half-century later. Here we give details on estimating parameters of the Scale-AR(1) model. The model is based on an AR(1) process for region-specific scale factors α rc,t centered at one, namely We impose the restriction that the scale factors not diverge indefinitely over time. We implement this by requiring that σ 2 c is such that where P denotes the present time period and R c denotes the set of regions in country c.

This yields
We are interested in estimating the country-and region-independent parameters φ and σ. We know from the observed data that the standard deviation of α rc,t declines as TFR declines, which is also in line with the theoretical expectations of Watkins (1990Watkins ( , 1991. Thus we need to find asymptotic values for those parameters.
Substituting the values from Equations (19) and (20) for Var(α rc,t ) and Var(∆α rc,t ) into Equations (15) and (16) gives These are the values we use for our projections.

A.2 Estimating the Error Correlations
The model errors are defined by We have experimented with eleven different ways of estimating the correlations of the errors, which are the elements A rs of the matrix A. Let e r,t denote the model error of region r at time t. LetÃ denote a matrix where each elementÃ rs is the empirical correlation between e r· and e s· over all time periods t, namelỹ A r,s = T e r,t e s,t T e 2 r,t · T e 2 s,t .
We considered the following methods for estimating the matrix A. In the first seven methods, all the within-country correlations are taken to be equal. The estimator of A is denoted byÂ. In all cases,Â r,r = 1 for all r, so in what follows,Â r,s refers to the cases where r = s.
3.Â r,s is the Bayesian posterior mean of the intraclass correlation coefficient.
4.Â r,s is the Bayesian posterior mode of the intraclass correlation coefficient.
5. Similar to 3., but with errors divided by 1/n r,t e 2 r,t with n being the number of available errors.
6. Similar to 4., but with errors divided by 1/n r,t e 2 r,t with n being the number of available errors. 7.Â r,s = B/C for all r = s, where N b and N c the number of terms in the corresponding sum that are not missing, and R is the number of regions.
8. The estimator of A is an approximation to the elementwise posterior median with uniform prior U [0, 1], namely:Â whereT is the average number of time periods per region. It is a weighted average of the prior mean and the data. Note that this is our chosen method.
We will now show that ifÃ * [≥0] is positive definite, then A is also positive definite. We can writeÂ where B is positive definite and J is the matrix all of whose entries are 1.
NowÂ is positive definite if and only if x Â x > 0 for all x = 0. Now The first term on the right-hand side of (24) is positive by definition, since B is positive definitive. The second term is non-negative: Thus (24) is positive and soÂ is positive definite.
9. Similar to 8. but with elements ofÃ computed as A r,s = 1/T T e r,t e s,t 1/T T e 2 r,t · 1/T T e 2 s,t (2014): First, standardize e r,t by dividing the errors by 1/n r,t e 2 r,t with n being the number of available errors. Then the elements of the estimated correlation matrixÂ are given aŝ

Bayesian method introduced by Fosdick & Raftery
where SS r = T e 2 r,t , SS s = T e 2 s,t , and SS r,s = T e r,t e s,t . Note that we are summing only over those time periods for which both countries, r and s, have errors available.
11. Similar to 10., but using (T + 1) instead of T in both the nominator and the denominator. This corresponds to the arcsin prior in Fosdick & Raftery (2014). Note that a version of this method was tested where the errors were not standardized, but it performed less well, producing smaller correlations. Fosdick & Raftery (2014) found that correlations between countries were quite different for high and low TFR values. In light of this, we estimated two separate correlation matrices, one for the cases where the country had overall TFR 5 or above, and the other when the TFR was below 5.
The estimated correlation matrices resulting from methods 1.-7. have the same value for all off-diagonal elements. The elements of matrices resulting from methods 8.-11. differ from one another. In the latter case, all non-defined elements are set to the mean of the off-diagonal elements.

A.3 Out of Sample Validation Measures
This section provides detailed definitions for our out of sample validation measures.
We denote by C the number of countries in our dataset, by R c the number of regions for country c, by R the total number of regions, so that R = C c=1 R c , and by T the number of time periods over which we validate. Furthermore, f rc,t denotes the observed TFR, andf rc,t denotes the point projection of the TFR (median of the predictive distribution), respectively, for region r of country c at time t.
The mean absolute error, MAE, is given by Rc r=1 T t=1 |f rc,t −f rc,t | .
The bias is given by bias = 1 RT C c=1 Rc r=1 T t=1 (f rc,t −f rc,t ) .
The Continuous Ranked Probability Score (CRPS) is an overall measure of the quality of a probabilistic forecast. If we are predicting the quantity X (here the TFR), and produce the predictive distribution F , and observe a value x, then the CRPS is defined by: where X and X are independent copies of a random variable with the distribution F (Gneiting & Raftery (2007), Eq. 21). We average the resulting values of CRPS across observations. Note that for the persistence method, the first part of the equation is zero.
It is not simple to calculate the expectation, E F , under the distribution F analytically, so we did it by simulation. To obtain the expectation E F , we sampled 5,000 values at random from the distribution F and took the average of the corresponding predictands.   (Tables 1 and 2).
Here, we add a comparison of the various correlation methods discussed in Appendix A.2 for the average TFR (Tables 3 and 4). The first row shows results when no correlation is taken into account. The number in parentheses of the following rows corresponds to the numbering of the eleven methods in the Appendix. In our study, we used method 8.