Statistical Background Small-Area
3 Empirical Examples: Methods and Data
3.1 Poisson Regression for Coale-Trussell Parameters

We now apply these results to a well known fertility model, Coale-Trussell, using open-interval data from public use samples of Brazil's 1991 census. The simplest version of the Coale-Trussell model schedule for marital fertility [4] assumes that fertility levels for five-year age groups are related to one another by the parametric specification

{11}

or, defining a new, mathematically more convenient parameter k=ln(M),

{12}

where G=6, the age groups are 20-24, 25-29, ..., 45-49, and the Ng* and vg* values are known constants ([4], p. 188).

The results for piecewise-constant models in the previous section therefore imply that with aggregate DLB data generated by a Coale-Trussell fertility schedule with parameters (k,m), the researcher can estimate (k,m) by maximizing the sample likelihood under the assumption that:

{13}

Most modern statistical software packages can estimate (k,m) from {Bg,Yg} using the generalized linear modeling approach. Estimation is based on the relationship

{14}

Broström [3] provided an example program for GLIM software [6]. Table 2 gives additional examples for the SAS and S-PLUS software systems.

Table 2

Code for Estimating Poisson Model

In both examples below, we assume that the researcher has prepared a data set called DLB. DLB must have 6 observations (one per age group 20-24...45-49), and must contain variables called N and V (the Coale-Trussell constants) and B and Y (the aggregate totals of births and years, respectively, from the last-birth data).

Both examples produce an estimated intercept, k=ln(M), and an estimated slope (m).

SAS
     proc genmod data = DLB ;

     off = log(Y*N);
     model B = V / dist=poisson offset=off ;

S-PLUS
     off <- log(Y*N)
     glm(B~V, data=DLB, family=poisson, offset=off)

3.2 Brazilian Census Data

In all of our examples we use data from public use samples of Brazil's 1991 demographic census, which collected current fertility information exclusively in DLB form. Our focus is on subnational estimates. We analyze fertility in 723 small areas, called municipalities (municípios in Portuguese) from the state of Minas Gerais. Municipalities are roughly equivalent to U.S. counties, and these 723 administrative units cover the state completely, with no overlap. We selected the 1991 Minas Gerais data as a test case because of earlier work by colleagues [2], who applied a very different set of statistical methods - Bayesian spatial smoothing of the standard BLY data - to estimate municipal-level fertility control.

Table 3 presents information on the 1991 census sample for Minas Gerais. All data in this table refer to unweighted samples of women 20-49, regardless of marital status, on the census date. The overall sample is very large, with approximately 392,000 women. Municipal-level sample sizes vary widely, however. Column (1) provides information on the number of woman-years available from the year before the census; by construction, this equals the number of women surveyed. Many municipalities have extremely small sample sizes: there is information for fewer than 100 women in 49 of the 723 municipalities, and for fewer than 200 women in 206 (49+157) municipalities. The smallest municipal-level sample contains information for only 30 women aged 20-49, and the median size for the municipal-level samples is 311 women. Column (3) displays data on the cross-municipality distribution of births in the year prior to the census (i.e., BLY birth data). Last-year births are in single digits (0-9) for 62 of the municipalities, and the majority of municipalities (556 of 723) have fewer than 50 last-year births to interviewed women.

TABLE 3
Distribution of Unweighted Samples Sizes across 723 Municipalities in Minas Gerais,
Brazil 1991 Public Use Census Samples

 

# of municipalities
in sample size range

   

# of municipalities
in sample size range

  (1) (2)     (3) (4)

WOMAN-YEARS

Past year (BLY)*

Five years
(DLB)

 

BIRTHS

Past year (BLY)

Five years
(DLB)

0-99

49

0

  0-9 62 0
100-199 157 3   10-19 157 12
200-499 361 102   20-49 337 89
500-999 98 198   50-99 105 195
1000+ 58 420   100+ 62 427
Total 723 723   Total 723 723
             
Minimum Size 30 125   Minimum Size 0 13
Median 311 1,162   Median 30 119
Maximum 47,865 192,869   Maximum 3,285 13,526
             
* Woman-years over the past year equals the number of women interviewed

The small sample sizes for many municipalities clearly create severe challenges for estimating sensible local-level fertility indices, and for analyzing inter-municipality differences. With such small samples of women and last-year births, estimated fertility indicators may vary widely across municipalities merely because of coincidental sampling noise, not because of any real features of the fertility regime. Variability in small samples is likely to be a particularly bad problem for the Coale-Trussell m parameter, which typically has high standard errors and wide confidence intervals even in large samples ([3], Table 3).

As an extreme example of sampling variability, consider the municipality with the smallest number of women interviewed, Serra da Saudade, in central-western Minas Gerais. The 1991 census sample for Serra da Saudade includes only 30 women - four each in the 20-24, 30-34, and 40-44 age groups, eight each in the 25-29 and 35-39 groups, and two women 45-49. (Readers can view and manipulate the entire census sample for this municipality in the Addendum's spreadsheet, Serra da Saudade.xls.If you are unable to process the Excel file or load it into another software package, you can view PDF file by clicking on the icon SpreadSheet) Only two women, one 25-29 and one 35-39, reported births in the year before the census. A demographer who heroically (and naively) estimated the Coale-Trussell m parameter from these data would arrive at a value of -1.43. In contrast, estimated m values for the four (more populous) municipalities that border Serra da Saudade are 1.01, 2.00, 0.71, and 1.35. Serra da Saudade appears, then, to be an anomalous island in sea of fairly high fertility control. This is nonsense, of course. Differences in m between Serra da Saudade and its neighbors are caused almost entirely by the coincidental fact that half of the reported births for 1991 (1 of 2) were in the 35-39 age group, and because the sample weight for the older of the two mothers is higher. As one might expect, sampling noise, rather than real fertility differences, is the main cause of the local variation in m.

Researchers can ameliorate the problem of small sample sizes when fertility data are collected in DLB form (as they are in the 1991 Brazilian census) by using information from woman-years that occurred more than one year before a survey. Columns (2) and (4) of Table 3 show how expansion of the Minas Gerais sample back to T=5 years before 1991 increases sample sizes. The numbers of observed births and woman-years in each municipality are approximately quadrupled by this procedure, and the distribution of municipal-level sample sizes shifts dramatically. With DLB data, the majority of municipalities have over 1000 woman-years and 100 births from which to estimate fertility. In contrast, only the very largest municipalities had equivalent sample sizes with the last-year-only data.

DLB sample sizes are still fairly small, but it is far more plausible that one can extract meaningful fertility information from the DLB than from the BLY samples. Roughly speaking, sample sizes quadruple, which should halve the standard errors of estimators. This represents a significant improvement in accuracy, and by reducing the level of noise in the data researchers can often "hear the signal" (i.e., identify systematic patterns of interest) much better.

3.3 Simulated Small-Sample Properties of BLY and DLB estimators

As demonstrated in [8], under the strong, idealized assumptions of many formal demographic models (constant age schedules and complete homogeneity within age groups), DLB data produce consistent parameter estimators that have lower variance than BLY estimators. Theoretical tests and empirical simulations in [8] also demonstrated that DLB estimators outperformed BLY under more realistic conditions, when age schedules change and fertility rates vary within age groups.

However, Schmertmann [8] compared DLB and BLY estimators only in models without parametric restrictions on the set of age-specific rates {l15-19,...,l45-49}. The Coale-Trussell model imposes parametric restrictions, and it is possible that the comparative performance of BLY and DLB estimators therefore differs. Most importantly, when fertility falls rapidly before the census date, as it did in Minas Gerais over the 1980s, DLB estimates of m for the census date may be biased downward, because the DLB data include earlier years in which fertility control was lower. Adding these woman-years to the DLB sample may therefore "contaminate" the estimate of current m.

The earlier simulations with changing rates in [8] suggest that any such bias is likely to be small. However, before calculating (M,m) estimates for hundreds of municipalities, it is instructive to compare small-sample properties of Coale-Trussell estimators based on actual BLY and DLB data from Minas Gerais.

We investigated these properties by drawing large numbers of subsamples of different sizes from the Minas Gerais 1991 public use sample. For each subsample we calculated Poisson regression estimates of (M,m) from both BLY and DLB versions of the data. We focus here on the second parameter m; results are nearly identical for M or k=ln M.

The distribution of m estimates over many subsamples allows us to assess (1) the magnitude of DLB biases caused by including woman-years from earlier periods of (presumably) lower fertility control, and (2) the reduction in sampling variability achieved by including these additional woman-years in the DLB sample.

Census public use files contain DLB data for approximately 265,000 married women 20-49 in Minas Gerais in 1991. Table 4 contains the weighted counts of these women by (a,u,d) cell, using T=5 as the maximum sampling period. For simulation purposes we assume that a population of women is distributed across (a,u,d) cells with exactly these proportions.

BLY estimates from the Minas Gerais sample, which use data from the last year only, are

k*= -0.449 M*=0.638 m*=1.036 [full sample BLY].

We assume that these are the "true" population parameters to be estimated from small samples. DLB estimates from Table 4, which add potentially contaminating fertility information from 1-5 years before the 1991 census, are

k= -0.433 M=0.649 m=1.001 [full sample DLB, T=5].

The two estimation methods produce similar parameter estimates in the large sample in Table 4, but we wish to investigate their comparative performance in small samples. In particular, we wish to learn which method is more likely to produce estimates of m for 1991 that are close to the "true" value m*=1.036, and to learn how performance of BLY and DLB estimators varies with sample size.

Table 4
Time Since Last Birth for Currently Married Women in Minas Gerais, 1991
Weighted Totals from Public Use Sample
   

Years Since Last Live Birth

 
AGE 0-1 1-2 2-3 3-4 4-5 5+/never TOTAL
20 13,808 9,962 5,378 2,028 693 12,934 44,804
21 15,441 11,407 6,989 3,745 1,724 14,132 53,438
22 17,898 13,356 9,596 5,681 2,510 15,888 64,928
23 18,160 14,961 10,918 6,612 3,772 17,407 71,829
24 18,194 15,365 12,558 8,008 4,823 17,644 76,592
25 18,195 16,326 13,270 8,992 6,303 20,046 83,133
26 17,621 16,612 14,005 10,044 7,093 22,148 87,523
27 16,686 15,518 14,165 11,518 8,483 25,445 91,813
28 16,341 15,419 14,118 11,122 8,685 27,764 93,449
29 14,203 13,965 13,585 11,411 8,978 30,813 92,954
30 13,413 12,994 12,134 10,637 9,070 32,502 90,751
31 11,159 11,151 11,187 10,316 9,096 36,002 88,910
32 10,336 10,470 10,390 9,547 9,375 41,639 91,758
33 8,487 9,674 9,779 8,971 8,877 44,627 90,415
34 7,707 8,120 8,190 8,008 8,030 46,806 86,861
35 6,362 7,091 6,705 7,275 7,487 47,965 82,885
36 5,701 6,269 6,626 6,250 6,571 51,869 83,287
37 5,148 5,503 5,831 5,214 5,737 52,947 80,380
38 3,881 4,564 4,980 4,649 5,569 52,261 75,904
39 3,387 3,821 4,769 4,124 4,692 52,046 72,839
40 2,805 3,794 3,899 3,629 4,079 52,416 70,622
41 2,162 2,377 3,390 3,159 3,428 49,106 63,622
42 1,856 2,179 2,644 2,532 2,721 48,779 60,711
43 1,344 1,685 2,314 2,425 2,608 50,428 60,804
44 917 1,262 1,786 1,952 2,295 47,061 55,274
45 634 834 1,456 1,543 2,015 46,482 52,965
46 515 613 1,303 1,317 1,727 45,853 51,328
47 268 348 646 1,039 1,339 42,750 46,391
48 234 386 437 664 950 42,124 44,795
49 200 144 374 449 788 41,749 43,705
               
TOTAL 253,066 236,169 213,423 172,859 149,519 1,129,632 2,154,669

For each of several sample sizes N Î{100, 200, 500, 1000, 2000, 5000} we conducted a Monte Carlo study by repeating the following procedure 200 times:

  • draw a pseudo-random sample of N women from the distribution of (a,u,d) in Table 4
  • construct BLY and DLB values for Bg and Yg, g=20-24,...,45-49
  • estimate Coale-Trussell parameters k and m by Poisson regression and record their values

Table 5 displays summary results from these studies. [Figure 1] displays the distribution of m estimates for the N=200 case, representing a typical municipal-level sample size in our example data.

Table 5
Summary measures for m estimates
over 200 Monte Carlo Samples at each Sample Size N

    BLY       DLB     % of samples
in which DLB
estimate is closer to m*

N

meana

biasb MAEc   meana biasb MAEc  
                   
100 1.17 0.13 0.69   1.01 -0.03 0.27   72
200 1.10 0.06 0.43   1.02 -0.01 0.20   72
500 1.04 0.00 0.26   0.97 -0.06 0.13   76
1,000 1.05 0.01 0.17   1.01 -0.03 0.09   74
2,000 1.03 -0.01 0.13   1.00 -0.04 0.08   73
5,000 1.04 0.01 0.08   1.00 -0.04 0.05   67
...                 ...
Populationd 1.036 0 0   1.001 -0.036 0.036   0
                   
 
 
a mean º [Ss ms] / 200, where s=1...200 indexes Monte Carlo samples
b bias º mean - 1.036
c MAE º[Ss |ms - 1.036| ] / 200
d Values on this row represent a single calculation from the full sample in Table 4, rather than Monte Carlo simulations. Under sampling without replacement, all possible samples of this size are identical.

The results in Table 5 illustrate that, in this particular case, DLB estimators produce markedly better results - indicated by lower mean absolute errors - at all sample sizes up to N=5000. As expected, falling fertility in Minas Gerais prior to 1991 leads to a tendency to underestimate fertility control m in DLB samples, as illustrated by the negative biases in the DLB column. This "contamination effect" is small, however. As one switches from BLY to DLB data, gains from decreased sampling variance overwhelm disadvantages of bias from "contaminated" samples that include fertility information from earlier years. The net gain is especially large when N=100 or 200, because in this range of sample sizes there is evidence that BLY estimators have positive small-sample biases, as well as high variance. (An asymmetry in the data causes the right-skewed distribution in the estimates: small samples in which births are far below population averages are more likely than small samples in which births are far above.)

The most important column of Table 5 is the rightmost, which displays the percentage of Monte Carlo samples in which the DLB estimate of m was closer than the standard BLY estimate to m*=1.036. The simulations show that despite small negative biases, the DLB estimate is approximately three times more likely to win this contest in any single sample of size N£2000, and approximately twice as likely to be closer to m* when N=5000.

In sum, simulation results with census data from Minas Gerais 1986-1991 illustrate that replacing the standard, truncated BLY form of open-interval data with DLB information produces superior estimators of Coale-Trussell parameters in small to moderately-sized samples such as those for the 1991 municipalities. In this particular case, as in the examples in the earlier paper [8], Monte Carlo evidence strongly suggests that DLB estimators yield better results. If the researcher's objective is to arrive at a sample estimate that is close to the population parameter, the benefits of decreased sampling variance with DLB data greatly exceed the small costs of increased bias. DLB is a far better bet to produce a good guess from a small sample.

Statistical Background Small-Area

logo70.gif (2450 bytes)

Estimating Parametric Fertility Models
with Open Birth Interval Data
Carl P. Schmertmann
André Junqueira Caetano
© 1999 Max-Planck-Gesellschaft ISSN 1435-9871
http://www.demographic-research.org/Volumes/Vol1/5