3 Empirical Examples: Methods and Data |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.1 Poisson Regression
for Coale-Trussell Parameters We now apply these results to a well known fertility model, Coale-Trussell, using open-interval data from public use samples of Brazil's 1991 census. The simplest version of the Coale-Trussell model schedule for marital fertility [4] assumes that fertility levels for five-year age groups are related to one another by the parametric specification
or, defining a new, mathematically more convenient parameter k=ln(M),
where G=6, the age groups are 20-24, 25-29, ..., 45-49, and the N_{g}* and v_{g}* values are known constants ([4], p. 188). The results for piecewise-constant models in the previous section therefore imply that with aggregate DLB data generated by a Coale-Trussell fertility schedule with parameters (k,m), the researcher can estimate (k,m) by maximizing the sample likelihood under the assumption that:
Most modern statistical software packages can estimate (k,m) from {B_{g},Y_{g}} using the generalized linear modeling approach. Estimation is based on the relationship
Broström [3] provided an example program for GLIM software [6]. Table 2 gives additional examples for the SAS and S-PLUS software systems.
3.2 Brazilian Census Data In all of our examples we use data from public use samples of Brazil's 1991 demographic census, which collected current fertility information exclusively in DLB form. Our focus is on subnational estimates. We analyze fertility in 723 small areas, called municipalities (municípios in Portuguese) from the state of Minas Gerais. Municipalities are roughly equivalent to U.S. counties, and these 723 administrative units cover the state completely, with no overlap. We selected the 1991 Minas Gerais data as a test case because of earlier work by colleagues [2], who applied a very different set of statistical methods - Bayesian spatial smoothing of the standard BLY data - to estimate municipal-level fertility control. Table 3 presents information on the 1991 census sample for Minas Gerais. All data in this table refer to unweighted samples of women 20-49, regardless of marital status, on the census date. The overall sample is very large, with approximately 392,000 women. Municipal-level sample sizes vary widely, however. Column (1) provides information on the number of woman-years available from the year before the census; by construction, this equals the number of women surveyed. Many municipalities have extremely small sample sizes: there is information for fewer than 100 women in 49 of the 723 municipalities, and for fewer than 200 women in 206 (49+157) municipalities. The smallest municipal-level sample contains information for only 30 women aged 20-49, and the median size for the municipal-level samples is 311 women. Column (3) displays data on the cross-municipality distribution of births in the year prior to the census (i.e., BLY birth data). Last-year births are in single digits (0-9) for 62 of the municipalities, and the majority of municipalities (556 of 723) have fewer than 50 last-year births to interviewed women.
The small sample sizes for many municipalities clearly create severe challenges for estimating sensible local-level fertility indices, and for analyzing inter-municipality differences. With such small samples of women and last-year births, estimated fertility indicators may vary widely across municipalities merely because of coincidental sampling noise, not because of any real features of the fertility regime. Variability in small samples is likely to be a particularly bad problem for the Coale-Trussell m parameter, which typically has high standard errors and wide confidence intervals even in large samples ([3], Table 3). As an extreme example of sampling variability, consider the municipality with the smallest number of women interviewed, Serra da Saudade, in central-western Minas Gerais. The 1991 census sample for Serra da Saudade includes only 30 women - four each in the 20-24, 30-34, and 40-44 age groups, eight each in the 25-29 and 35-39 groups, and two women 45-49. (Readers can view and manipulate the entire census sample for this municipality in the Addendum's spreadsheet, Serra da Saudade.xls.If you are unable to process the Excel file or load it into another software package, you can view PDF file by clicking on the icon ) Only two women, one 25-29 and one 35-39, reported births in the year before the census. A demographer who heroically (and naively) estimated the Coale-Trussell m parameter from these data would arrive at a value of -1.43. In contrast, estimated m values for the four (more populous) municipalities that border Serra da Saudade are 1.01, 2.00, 0.71, and 1.35. Serra da Saudade appears, then, to be an anomalous island in sea of fairly high fertility control. This is nonsense, of course. Differences in m between Serra da Saudade and its neighbors are caused almost entirely by the coincidental fact that half of the reported births for 1991 (1 of 2) were in the 35-39 age group, and because the sample weight for the older of the two mothers is higher. As one might expect, sampling noise, rather than real fertility differences, is the main cause of the local variation in m. Researchers can ameliorate the problem of small sample sizes when fertility data are collected in DLB form (as they are in the 1991 Brazilian census) by using information from woman-years that occurred more than one year before a survey. Columns (2) and (4) of Table 3 show how expansion of the Minas Gerais sample back to T=5 years before 1991 increases sample sizes. The numbers of observed births and woman-years in each municipality are approximately quadrupled by this procedure, and the distribution of municipal-level sample sizes shifts dramatically. With DLB data, the majority of municipalities have over 1000 woman-years and 100 births from which to estimate fertility. In contrast, only the very largest municipalities had equivalent sample sizes with the last-year-only data. DLB sample sizes are still fairly small, but it is far more plausible that one can extract meaningful fertility information from the DLB than from the BLY samples. Roughly speaking, sample sizes quadruple, which should halve the standard errors of estimators. This represents a significant improvement in accuracy, and by reducing the level of noise in the data researchers can often "hear the signal" (i.e., identify systematic patterns of interest) much better. 3.3 Simulated Small-Sample Properties of BLY and DLB estimators As demonstrated in [8], under the strong, idealized assumptions of many formal demographic models (constant age schedules and complete homogeneity within age groups), DLB data produce consistent parameter estimators that have lower variance than BLY estimators. Theoretical tests and empirical simulations in [8] also demonstrated that DLB estimators outperformed BLY under more realistic conditions, when age schedules change and fertility rates vary within age groups. However, Schmertmann [8] compared DLB and BLY estimators only in models without parametric restrictions on the set of age-specific rates {l_{15-19,...,}l_{45-49}}. The Coale-Trussell model imposes parametric restrictions, and it is possible that the comparative performance of BLY and DLB estimators therefore differs. Most importantly, when fertility falls rapidly before the census date, as it did in Minas Gerais over the 1980s, DLB estimates of m for the census date may be biased downward, because the DLB data include earlier years in which fertility control was lower. Adding these woman-years to the DLB sample may therefore "contaminate" the estimate of current m. The earlier simulations with changing rates in [8] suggest that any such bias is likely to be small. However, before calculating (M,m) estimates for hundreds of municipalities, it is instructive to compare small-sample properties of Coale-Trussell estimators based on actual BLY and DLB data from Minas Gerais. We investigated these properties by drawing large numbers of subsamples of different sizes from the Minas Gerais 1991 public use sample. For each subsample we calculated Poisson regression estimates of (M,m) from both BLY and DLB versions of the data. We focus here on the second parameter m; results are nearly identical for M or k=ln M. The distribution of m estimates over many subsamples allows us to assess (1) the magnitude of DLB biases caused by including woman-years from earlier periods of (presumably) lower fertility control, and (2) the reduction in sampling variability achieved by including these additional woman-years in the DLB sample. Census public use files contain DLB data for approximately 265,000 married women 20-49 in Minas Gerais in 1991. Table 4 contains the weighted counts of these women by (a,u,d) cell, using T=5 as the maximum sampling period. For simulation purposes we assume that a population of women is distributed across (a,u,d) cells with exactly these proportions. BLY estimates from the Minas Gerais sample, which use data from the last year only, are
We assume that these are the "true" population parameters to be estimated from small samples. DLB estimates from Table 4, which add potentially contaminating fertility information from 1-5 years before the 1991 census, are
The two estimation methods produce similar parameter estimates in the large sample in Table 4, but we wish to investigate their comparative performance in small samples. In particular, we wish to learn which method is more likely to produce estimates of m for 1991 that are close to the "true" value m*=1.036, and to learn how performance of BLY and DLB estimators varies with sample size.
For each of several sample sizes N Î{100, 200, 500, 1000, 2000, 5000} we conducted a Monte Carlo study by repeating the following procedure 200 times:
Table 5 displays summary results from these studies. [Figure 1] displays the distribution of m estimates for the N=200 case, representing a typical municipal-level sample size in our example data.
The results in Table 5 illustrate that, in this particular case, DLB estimators produce markedly better results - indicated by lower mean absolute errors - at all sample sizes up to N=5000. As expected, falling fertility in Minas Gerais prior to 1991 leads to a tendency to underestimate fertility control m in DLB samples, as illustrated by the negative biases in the DLB column. This "contamination effect" is small, however. As one switches from BLY to DLB data, gains from decreased sampling variance overwhelm disadvantages of bias from "contaminated" samples that include fertility information from earlier years. The net gain is especially large when N=100 or 200, because in this range of sample sizes there is evidence that BLY estimators have positive small-sample biases, as well as high variance. (An asymmetry in the data causes the right-skewed distribution in the estimates: small samples in which births are far below population averages are more likely than small samples in which births are far above.) The most important column of Table 5 is the rightmost, which displays the percentage of Monte Carlo samples in which the DLB estimate of m was closer than the standard BLY estimate to m*=1.036. The simulations show that despite small negative biases, the DLB estimate is approximately three times more likely to win this contest in any single sample of size N£2000, and approximately twice as likely to be closer to m* when N=5000. In sum, simulation results with census data from Minas Gerais 1986-1991 illustrate that replacing the standard, truncated BLY form of open-interval data with DLB information produces superior estimators of Coale-Trussell parameters in small to moderately-sized samples such as those for the 1991 municipalities. In this particular case, as in the examples in the earlier paper [8], Monte Carlo evidence strongly suggests that DLB estimators yield better results. If the researcher's objective is to arrive at a sample estimate that is close to the population parameter, the benefits of decreased sampling variance with DLB data greatly exceed the small costs of increased bias. DLB is a far better bet to produce a good guess from a small sample. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Estimating Parametric Fertility Models with Open Birth Interval Data Carl P. Schmertmann André Junqueira Caetano © 1999 Max-Planck-Gesellschaft ISSN 1435-9871 http://www.demographic-research.org/Volumes/Vol1/5 |