2 Statistical Background |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2.1 DLB Data
In order to fix ideas, consider the hypothetical sample in Table 1. Suppose that a survey is taken at time t, and that each of six women reports her age (a), and the number of months since her last live birth. These data appear on the left-hand side of the table. In many data sets (including the public use samples of the Brazilian census that we use later in the paper), DLB data are available only in integer-truncated form; this version appears in the table in the "Years" column. From this point forward we will treat the number of years, rather than the number of months, as if it were the DLB data observed by the researcher. Fertility information from too far in the past may be unrepresentative of current patterns. It is therefore desirable to restrict the analysis to the relatively recent past. The researcher can do this by considering only woman-years lived within T years of the survey, where T is a value selected by the researcher. The value of T should be chosen after weighing the benefits of increased sample sizes against the costs of possible biases (see [8]). In Table 1 we use T=5.
The column labeled (u) displays the number of complete years since last birth, truncated at the upper limit of T=5. This variable appears in many subsequent calculations. By construction u Î {0,1,...,T}, and the researcher observes MIN(T,u+1) woman-years from each individual sampled. The set of columns under the heading "Five-Year History" illustrates the available fertility histories from this sample. Each of the five right-hand columns corresponds to a one-year period. Cells for which histories are known contain the woman's age at the end of the year; other cells are blank. Woman-years that include births are emphasized with bold face and brackets. The column labeled (d) contains a dummy indicator, equal to one if the woman had a birth within the five-year period, and equal to zero otherwise. All women in the sample contribute one person-year of exposure to the rightmost column, corresponding to the period (t-1,t]. Each woman's age at the end of this period is simply a, her age on the survey date. All women with u³1 also contribute information about fertility in the period (t-2,t-1], all women with u³2 contribute information about (t-3,t-2], and so forth. Standard methods for deriving age-specific fertility rates from a sample like that in Table 1 use only the rightmost column. For each age group, the researcher sums the births in the past year only, and divides by the number of women in that age group on the survey date. For Table 1 these calculations are trivial. Because there is only one birth recorded in the year before the survey (to woman #6) all estimated fertility rates would be zero, except f_{30-34}=0.50. This is an unrealistic age schedule for fertility, of course, the main cause of which is the very small sample size. As Table 1 makes clear, however, there is considerably more fertility information embedded in the (a,u,d) data than the last year alone reveals. The rightmost column corresponding to the year before the sample contains 6 woman-years and 1 birth. In contrast, the available information from the same women for the last T=5 years contains 20 woman-years and 4 births. 2.2 Estimating a Simple Fertility Schedule with DLB Data The previous paper [8] showed that using the expanded sample of woman-years from a DLB data set can substantially reduce the sampling variance of estimators for small samples. Furthermore, if the researcher restricts the sampling period to five or fewer years before the survey (as in Table 1 above), potential biases caused by unobserved heterogeneity and time trends in fertility rates appear to be small. For one simple fertility schedule - a piecewise-constant function with no restrictions on the pattern of fertility across different age groups - estimation with DLB data is extremely simple. Specifically, for the fertility schedule
the previous paper [8] demonstrated that maximum likelihood estimates for parameters l_{1}...l_{G} from DLB data are
where B_{g} and Y_{g} are the counts of births and woman-years, respectively, for age group g. As an example, in the five-year DLB information in Table 1, B_{25-29}=2 births (woman #1 and #2 each had a birth in this age group) and Y_{25-29}=5 (1 year each from women #1 and #5, and 3 years from woman #2). The estimated l_{25-29} is 0.40, compared to the estimate of zero from the last-year only data. The estimators in {3} are simple and familiar. However, they are somewhat counterintuitive, because much of the measured exposure (Y) occurs after, rather than before, the measured events (B). Despite this inversion of the usual order of time, the maximum likelihood estimators are still familiar-looking event/exposure ratios. Allison [1] showed that similar counterintuitive results hold for backward recurrence times (times since last event) in many stochastic models. 2.3 Summary Indices for Piecewise-Constant Models One very important aspect of {3} is that maximum likelihood estimation requires only a handful of summary indices (B_{1}...B_{G},Y_{1}...Y_{G}), rather than the full set of individual-level DLB data. This section demonstrates that this is a property of any fertility model in which the age schedule is piecewise-constant across age groups, a fact that simplifies the estimation of many parametric models, including Coale-Trussell. For any model with piecewise-constant rates, fertility at age a may be written as:
where l_{g} is the model's fertility level for age group g, and I_{g}(a) is an indicator function equal to 1 if age a belongs to group g, and equal to zero otherwise. The age-group rates l_{1}...l_{G} may be unrestricted, as in {2}, or they may be required to conform to some parameterized schedule l_{g}=l_{g}(q). The important point for the exposition is that fertility levels are identical at all ages within each group. From [8], the log likelihood for an individual observation (a_{i},u_{i},d_{i}) is
The first term in this equation corresponds to the probability of a birth at age a_{i}-u_{i} (if d_{i}=1 and a birth was reported), and the second term corresponds to the probability of surviving without a birth over the age interval (a_{i}-u_{i},a_{i}). When the model fertility schedule is piecewise-constant, this may be rewritten in terms of age groups as
Summing over observations i yields the sample log likelihood
or, more intuitively,
The derivation of {8}shows that the indices (B_{1}...B_{G},Y_{1}...Y_{G}) contain all the information necessary for maximum likelihood estimation of any piecewise-constant fertility model from last-birth data. Event and exposure totals for each age group are sufficient summaries of the observable fertility histories. 2.4 Poisson Estimation Equation {8} is closely related to the Poisson distribution. The natural logarithm of the probability that a Poisson process with rate l generates B events in Y years is
where C = B ln Y - ln B!. Except for the C terms (which do not vary with l), the log likelihood in {8} is a sum of the logs of G Poisson probabilities, one per age group. Thus, for any individual-level model with piecewise-constant fertility levels, one can calculate maximum likelihood estimators for parameters by pretending that the aggregate-level DLB data have distributions
This estimation procedure, derived here for open-interval DLB data, is identical to that used by Broström [3] for standard fertility data. 2.5 Discussion The distributional result in {10} leads us to one of this paper's main points. A researcher using DLB data may use standard methods for estimating rates or fertility parameters. Estimation procedures for open-birth interval (DLB) or last-year (BLY) data differ only in the manner in which the data sets are assembled, not in the quantitative methods used. DLB data require no special statistical techniques, despite the unusual sampling scheme that generates the DLB versions of Y_{g, }B_{g}, or other data summaries. This conclusion held for the simple model presented in [8] (Equations (2) and (3)), and the exposition here shows that it is equally true for any fertility model in which rates are a function of age group. Furthermore, because age groups may be arbitrarily narrow, we expect (although we have not formally proven it here) that the main result - i.e., that appropriate estimation methods are identical for BLY and DLB data - also applies to models in which f(a) is a continuous function of exact age. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Estimating Parametric Fertility Models with Open Birth Interval Data Carl P. Schmertmann André Junqueira Caetano © 1999 - 2000 Max-Planck-Gesellschaft ISSN 1435-9871 http://www.demographic-research.org/Volumes/Vol1/5 |