Measuring US fertility using administrative data from the Census Bureau

BACKGROUND Longitudinal data available for studying fertility in the United States are not representative at the state level, limiting analyses of subnational variation in US fertility. The US Census Bureau makes available restricted data that may be used for measuring fertility, but the data have not previously been described for a scholarly audience or used for fertility research. OBJECTIVE This paper describes and analyzes restricted-use administrative birth data available through the Census Numident for nearly all US births for more than the last century. Within these data, most births since 1997 are linked to parents through the Census Household Composition Key (CHCK). These analyses are designed to illustrate the scope and limitations of these data for the study of US fertility. METHODS We describe the creation and content of the Census Numindent and CHCK data sets and compare the data to published US vital statistics. We also analyze the geographic coverage of both data sets and compare the demographic composition of the new data sources to national demographic composition. We further illustrate how these novel data sources may be used by comparing them to survey responses at the individual level. CONTRIBUTION This paper describes an underutilized source of national US data for studying fertility, shows the quality of these data by performing analyses, and explains how scholars can access these data for research.


Introduction
Since the family-building model came to prominance in the 1970s (Menken 1974;Trussell and Menken 1978), demographers have agreed that individuals' prior fertility is key to understanding their reproductive lives, but longitudinal data on individuals' childbearing in the United States is only representative at the national level. This is consequential, as efforts to examine fertility in the context of fertility delay and decline are increasingly focused on parity, or the number of births women have had (Hartnett and Gemmill 2020;Beaujouan and Berghammer 2019;Zeman et al. 2018). As state-level policies and conditions continue to hold substantial demographic salience, the absence of data facilitating comparisons across subnational geographies limits demographic research (Riley et al. 2021;Montez et al. 2020;Chetty et al. 2014). In this paper we describe data sources which might be used to fill this gap in US data on parents and children. The data we describe may be anonymously linked at the individual level using a Census Bureau-assigned key that links to most Census Bureau administered surveys, increasing their utility for demographic research.
Restricted census, survey, and administrative data are available from the Census Bureau through the Federal Statistical Research Data Centers (FSRDCs) to researchers on approved projects. The data holdings change over time and these changes can be particularly substantial among data derived from administrative records. Administrative data in general often lack comprehensive documentation because their primary purpose is not academic research. Data of this type held by the Census Bureau are no exception, and the absence of documentation can present a barrier to researchers' knowledge of and ability to use the Census Bureau's substantial data holdings. This paper describes the Census Numerical Identification (Numident) and the Census Household Composition Key (CHCK) data files. The Census Numident and CHCK files are derived from the Social Security Administration (SSA) Numident, and they provide birth information and links between children and birth parents as reported on Social Security Number (SSN) applications. In addition, we analyze these data files to assess their quality and comparability to vital statistics data and survey data. We present results from these analyses so researchers understand the available data. We conclude by discussing the use of these restricted data for fertility research.

Data description 2.1 SSA Numident
The SSA uses the Numident to maintain records of Social Security Number (SSN) holders in the United States. While SSNs were created and issued starting in 1936, electronic tracking of SSN information in the SSA Numident began in 1972. All existing SSN information has been digitized and is included in the electronic SSA Numident file (Puckett 2009). The SSA Numident contains all recorded interactions individuals have with the SSA related to SSNs. Thus, it includes information on SSN applications, claim records, death reporting, and requested changes to SSN information. There are now more than one billion transactions within the SSA Numident for approximately 518 million SSN holders in the Numident (Finlay and Genadek 2021).
Prior to 1989, individuals or individuals' parents filled out the SSA application for a Social Security Card, Form SS-5, which included date of birth, place of birth, gender, race, citizenship status, parents' names, and parents' SSNs. Starting in 1989, the SSA entered into agreements with each state in order to enumerate individuals at birth. When infants are now born in hospitals and birthing centers, the parents are asked if they would like the birth certificate data to be transmitted to the SSA to create an SSN for the individual at birth. SSA publications suggest that more than 95% of births in the United States are assigned an SSN through this enumeration at birth (Puckett 2009). That information is given to the state's vital statistics office, and the vital statistics office sends the information from the birth certificate to the SSA to create a record for the infant and issue an SSN. Selected information from the birth certificate, including name, date of birth, place of birth, mother's name, mother's SSN, father's name, and father's SSN, are shared with the SSA. If parents do not elect to have their child enumerated at birth by the SSA, they can apply for an SSN through an SSA application office. Moreover, adoptive parents can apply for new SSNs for adopted children through the SSA prior to or following adoption, which include their adoptive parents' information rather than the birth parents' information. 3 The Census Bureau obtains the SSA Numident data in quarterly updates from the SSA for the purposes of improving Census Bureau survey and decennial census data, performing record linkage, and using the data for research and statistical projects. While most information from the SSA Numident is included in this transfer, the Census Bureau does not receive the parents' SSN information from an individual's SSN application, although they do receive parents' names. The Census Bureau creates two research files useful for measuring fertility by capturing birth information using the SSA Numident file. The first is the Census Numident file and the second is the CHCK.

Census Numident
The Census Bureau creates the Census Numident by processing quarterly updates from the SSA transaction-level data to create a person-level research file that includes the history of individual-level interactions with the SSA Numident. Like the SSA Numident, the Census Numident is a cumulative file. In the Census Numident the SSN is replaced with a Census Bureau Protected Identification Key (PIK), a unique anonymous identifier. Some other Personally Identifying Information (PII), including name, is removed from the Census Numident file. The resulting data file, with the PIK, is then made available to Census Bureau staff and external researchers for approved Census Bureau production and research projects.
The Census Numident includes one record per person who has received an SSN in the United States. The scope of information in an individual's Census Numident record varies based on when the individual received an SSN, how the individual applied for an SSN, and if the individual has interacted with the SSA, such as for a name change. In general, most records include complete date of birth, place of birth, and sex. The universe for this file is all individuals receiving an SSN, so unlike the birth records from birth certificates in the United States, it includes people born outside of the United States who apply for an SSN. However, place of birth is obtained for all SSN applicants.

Census Household Composition Key (CHCK)
In addition to the Census Numident, the Census Bureau creates the CHCK files. These files are crosswalks of individuals aged 0-19 with a PIK linked to their mother's and father's PIKs. The file also includes the child's exact birth date as reported to the SSA. This is not the same file as the SSA KIDLINK database (or Internal Revenue Service (IRS) research file DM-2) which uses parents' SSNs on the child's SSN application to directly link parents and children. 4 Without the SSNs of parents, the Census Bureau assigns PIKs to the parents in the child's Numident record using the Person Identification Validation System (PVS), which probabilistically assigns PIKs to respondents in surveys generally by matching information in the survey to a composite reference file with PIKs (Wagner and Layne 2014). In this case, PVS is used to assign PIKs to the parents of the children in the Census Numident based on the parents' reported names (Luque and Wagner 2015). In addition to using the names, the child and parent pair in the Census Numident must be confirmed at the same address within the PVS reference file or the decennial census. This coresidential requirement is necessary to limit and refine the linkages based on names alone. The PVS reference file addresses are extracted from trusted federal administrative records, which have been previously processed through the PVS system at the Census Bureau. Detailed information on the creation of the CHCK file is documented in Luque and Wagner (2015), which describes the creation of a preliminary version of the CHCK file (at the time called Census Kidlink) using the 2007 Census Numident.
The CHCK file is not cumulative. Instead, yearly versions of the CHCK are created based on vintages of the Census Numident. For each vintage year of the Numident, the corresponding CHCK file includes parent links for observations aged 0-19. The first CHCK file is available for Census Numident vintage 2016, and thus the births start in 1997. The CHCK file for 2019 includes birth counts complete through 2018. 5

Birth counts compared to vital statistics
To assess the quality and completeness of the birth records obtained through the Census Numident we compare the United States-born individuals in the Census Numident to the births occurring in the United States published by the Center for Disease Control and Prevention's (CDC) National Vital Statistics System (NVSS). 6 Table 1 shows yearly birth counts starting in 1910 based on birthdates for all individuals in the Census Numident (Column 1) and yearly counts for those born within the United States (Column 2). 7 In the most recent years, nearly all births recorded in the Census Numident occur in the United States. Also included in Table 1 is the total count of yearly births occurring in the United States obtained from the CDC's NVSS. 8 The number of yearly births in the Census Numident is very close in number to the reports from the NVSS, which is especially expected starting in 1989 because of the enumeration at birth being closely tied to birth certificates. However, even prior to 1989, the Numident captures just slightly more births than published through the NVSS back to 1970. This is shown clearly in column 4, which shows the proportion of US births in the Census Numident compared to the NVSS. The slight difference, with more births found in the Census Numident than the vital statistics, is potentially the result of a number of factors, including inaccurate place of birth information reported to SSA, and some US births outside of hospitals without birth certificates being excluded from the vital statistics counts. Prior to birth year 1969, the Numident generally contains fewer births than reported by vital statistics, though it is near or above 0.90 prior to 1920. 9 At the national level, the birth data in the Census Numident look complete and comparable to the birth reports from the NVSS. To further understand the coverage of the Numident birth data, we count births by state of occurrence between 2009 and 2018 using the place of birth information in the Census Numident and compare them to the published births by state of occurrence from the CDC NVSS. 10 Table 2 shows the state-level coverage of the Census Numident birth information and includes counts of births for all US territories combined. 11 There is minimal variation in state-level coverage of births by the Census Numident, with the proportion of births in the Census Numident divided by the CDC NVSS ranging from .994 in Wisconsin to 1.052 in Maryland, with 26 states being between 0.999 and 1.001. While we present results for state-level births, detailed place of birth is also included in the Census Numident.

Analyses of children linked to parents
We combine four CHCK files by starting with the 2016 version and adding any additional births that appear in each successive file through the 2019 version. We keep one childmother link and/or one child-father link if a child is linked to different mothers and fathers across years. 12 Table 3 shows the birth counts and parental linkages for each birth occurring in at least one of the CHCK files, covering birth cohorts of 1997-2019. Parental linkages improve as time progresses over the first few years after a birth because the parent-child pair must be confirmed at an address in the PVS reference file or the decennial census, a requirement which is difficult to meet immediately after a birth because there is often a delay in the infant appearing in the administrative records. As shown in Table 3, only 80% of the births in 2018 and about 88% of births in 2017 are linked to any parent. Thus, the linkage rates of future CHCK versions will increase for children born in 2017 and 2018, though linkage rates in the most recent birth years will always be slightly lower than earlier years. In all of the birth cohorts prior to 2017, an average of 94.5% of all births are linked to at least one parent. The parental linkage rates for the birth cohorts of 1997-2016 are slightly higher for those born within the United States, 95.6%. Table 3 also shows the percentage of children in the CHCK linked to a mother, linked to a father, or linked to both, by birth year. In most years, about 15% of children are linked to only a mother, while about 2.5% are linked to just a father, and the remaining 82.5% are linked to two parents. 13 These parent linkages are based on the names on the SSN application and documented coresidence with a parent. While some of the two-parent linkages are missing due to issues with the probabilistic name matching and coresidence with a parent, SSN applications do not always include information for both parents, as fathers' names are often not included on birth certificates. 14 The children missing links to their parents in the CHCK file are not expected to be random. The linkage of children to parents in the CHCK file is first limited to parents that have been assigned a PIK. If a child is born in the United States to a parent that has not been assigned a PIK (they do not have an SSN or an Individual Taxpayer Identification Number (ITIN)), it will not be possible to link them together. Linkages will also not be made when the parents' names in the SSN application are inaccurate or the probabilistically matched parent-child pair could not be confirmed at a location in the PVS reference file. Finally, the children may not be coresiding with the parent whose name is listed on the birth certificate or given to the SSA. Thus, we anticipate biases in the CHCK data when compared to the overall national population. Table 4 shows basic demographic characteristics (sex, race/ethnicity, birthplace) for those born between 1997 and 2018 linked to at least one parent in the CHCK, in the full Census Numident, and in the weighted 2019 1-year American Community Survey (ACS) Public Use Microdata (PUMS) (Ruggles et al. 2021). The weighted ACS PUMS is nationally representative, and thus provides the national comparison.
The three data sets have similar proportions of men and women, but the race/ethnicity breakdown is slightly different. For those with parent links in the CHCK, 54.61% are White non-Hispanic, while 22.69% are Hispanic. The full Census Numident is similar, with 53.65% of the respondents being White non-Hispanic and 23.89% being Hispanic; however, when we look at the weighted 2019 ACS, which is the nationally representative estimate, 50.79% of these birth cohorts are White non-Hispanic and 24.84% are Hispanic. There are smaller yet similar differences in most of the other non-White groups (Asian, Black, and Other), where the ACS has a larger percentage of the weighted total than the CHCK or the Numident.
In addition to demographic variation in the linkage of children to parents in the CHCK, there is also geographic variation. Figure 1 shows a map of the United States with state-level parent-child linkage rates from the CHCK data. The darkest areas on the map are states where the proportion of births linked to parents is between .935-.97, while the lightest states are between .83-.865. Similar to PIK rates in general (Rastogi et al. 2012), states in the southwest have the lowest linkages between children and parents. This is likely due to fewer parents in these states having SSNs and ITINs than in other states. 13 The total births in Table 3 are not identical to the total births in the Census Numident reported in Table 1; this is due to the variation in vintages of the Census Numident used to create each CHCK file. We use the 2020Q3 vintage of the Census Numident in Table 1, which is also more recent than the 2016-2019 CHCK files. 14 Legal parents of the same sex can have both names on an SSN application (https://www.ssa.gov/people/same-sexcouples/), but the 2016-2019 CHCK files limit the mother and father links by sex.

Comparing administrative fertility data to survey data
The Census Numident provides an administrative record of births based on birth certificates since 1989, and the CHCK files include probabilistic links between children and parents listed on birth certificates since 1997. This rich data source on births is a near complete record of all births occurring in the United States, and parent links are made to around 90% of the births since 1997. In order to further analyze the birth and parent links contained in the CHCK file, we linked all respondents born after 1997 and under age 19 at the time of the 2005 through 2019 1-year ACS surveys to the CHCK file using the Census Bureau-assigned PIK. Table 5 shows the total number of children meeting the age and birth year criteria by year of the ACS. Of those in the universe, it also shows the total number and percentage that were assigned a PIK. Approximately 85%-92% of the children in the ACS were assigned a PIK and were thus eligible to be linked to the CHCK. Panel A of Table 5 contains estimates of children linked to mothers. Column 4 shows the total number of children that were linked to a mother in the CHCK. Nearly 95% of children with a PIK had a mother indicated in the CHCK. Column 6 shows the number of these children that reside in the ACS household with the mother, as indicated in the CHCK. Approximately 80%-85% of children with PIKs in the ACS reside with the mother assigned to them in the CHCK. When we look at those linked to a mother in the CHCK, about 85%-90% are living with the mother indicated in the CHCK at the time of the ACS. While this suggests there may be error in the assignment of mothers to children in the CHCK, this result also shows the universe of the CHCK file, in which the mother's information is coming from the SSA Numident via a birth certificate and children may not always reside with that mother. 15 Panel B of Table 5 shows the same estimates for fathers. A smaller percentage of children in the CHCK are linked to a father than to a mother, and the percentage of linked children who are residing with the father indicated in the CHCK is about 10% less than for mothers. This is expected, as many children are born to mothers without a father present, and children are more likely to reside with the mother if the parents do not live together (Smock and Schwartz 2020).
The CHCK file is organized at the child level, yet it is possible to use the data with the parents as the primary unit of analysis. Specifically, the data can be reshaped to focus on mothers, with their children and children's birthdates from the CHCK indicating births to the woman. Using the data in this way allows for the study of birth parity. We focus on mothers with births in the previous year in the CHCK and then link these mothers to the ACS. The ACS survey asks women between the ages of 15 and 50 if they gave birth in the past twelve months. In addition to using the ACS to look at children linked to their parents, we analyze women of reproductive age in the ACS and their response to this fertility question, the children residing in their household, and their link to children in the CHCK. Table 6 presents the results of these comparisons. Column 1 shows the total number of ACS respondents in the universe for the fertility question, or women of reproductive age (ages 15-50) in each year of the ACS since 2005. The number of those indicating they gave birth 15 In a small number of cases, parental information comes from the SSA's application for a Social Security Card (Form SS-5). in the last year is shown in Column 2, and is generally around 5%, as shown in Column 3. About 74%-81% of the women indicating they gave birth in the last year in the fertility questions also had a child under age 1 living with them in their ACS household at the time of the survey (these percentages are shown in Column 4). Although not shown in Table 6, an additional 7.5%-9.0% of women indicating they gave birth in the previous year in the ACS have a child of age 1 living with them but do not have a child under age 1 living with them. Thus, close to 90% of the women who gave birth in the last year are living with a baby at the time of the ACS, and the other 10% of women are either not living with the infant they birthed in the past year or there is misreporting in their fertility status or the age of their children. 16 The next panel of Table 6 shows similar information, but these columns present the percentage of reproductive-age women in the ACS that gave birth in the past year according to the CHCK, rather than the ACS fertility question. The percentage of women with a CHCK birth in the last year (Column 6) is slightly lower than the ACS fertility question, ranging between 3.55% and 4.28%, though a larger percentage of these women are residing in the household with the CHCK-linked child at the time of the ACS than women indicating they gave birth in the last year in the survey question (Column 7).
In the third panel of Table 6 we limit the sample to women who indicated they had a birth in the last year in the ACS and had a birth according to the CHCK. We find that between 3.23%-3.86% of reproductive-age women in the ACS in a given sample year gave birth according to both data sources. As shown in the final column, between 63.41%-71.55% of those that reported a birth in the ACS also gave birth based on linkages in the CHCK.
Comparing the CHCK birth information to that in the ACS provides insight into what the data captures. The limitation of many administrative records is the inability to measure US residents who do not have SSNs and ITINs, and the issue is present in these data. There are residents in the United States captured by surveys like the ACS that are not captured in our administrative records. However, for those captured by the Census Numident and CHCK, our ability to observe most of the linked children and parents residing together in the survey data demonstrates that the assignment of children to parents is of high quality.

Using Census Bureau data for fertility research
We have shown that the counts of births in the restricted-use Census Numident are similar to those from vital statistics. While the Census Numident includes all births assigned SSNs in the United States and the vital statistics include all births occurring in the United States, when we limit the Census Numident to births occurring in the United States the counts are very similar, even at the state level. The CHCK data, which provides linkages between children and their parents at the time of birth from 1997 onward, make the birth records in 16 The ACS fertility question states, "Has this person given birth to any children in the past 12 months?" It also includes the following in the instructions: "Mark the 'Yes' box if the person has given birth to at least one child born alive in the past 12 months, even if the child died or no longer lives with the mother. Do not consider miscarriages, or stillborn children, or any adopted, foster, or stepchildren." We calculate the previous 12 months based on the date the ACS questionnaire was completed. Future work will investigate the 10% of women with reported births who do not coreside with an infant in the ACS. the Census Numident more useful for research. We find that over 90% of births are linked to at least one parent.
The Census Numident and CHCK data are an excellent resource for research on US fertility. These restricted-use data are available through the Census Bureau and the FSRDC network, providing an opportunity for detailed analyses of fertility and the family in the United States. The FSRDC network currently includes 31 physical research centers at universities and research institutions, and many researchers are now accessing the data through the network virtually. 17 All research is performed within the restricted environment, and all results are reviewed before release to ensure the confidentiality of respondents. Researchers from any institution can apply to use the Census Numident and CHCK data through the standard Census Bureau FSRDC application procedures, starting by reviewing the research proposal process documentation and by contacting the closest physical FSRDC. 18 Within the Census Bureau's Data Linkage Infrastuture, the research possibilities grow when Census Numident and CHCK data are linked at the individual level to other data held at the Census Bureau. Administrative and survey data with detailed household location information, combined with the detailed place of birth information in the Census Numident, allow for substate analyses of births not possible with most fertility data. It is also possible for researchers to measure parity and estimate fertility by parents' characteristics for nearly all births occurring in the United States using the CHCK linked to survey and administrative data. We link the CHCK data with ACS microdata, finding that substantial numbers of linked parent-child pairs are living together shortly after the child's birth. While these rich data present robust opportunities for research, our linkage between CHCK and the ACS illustrates -in a small way -how using linked administrative and survey data can generate analytic challenges, since not all women who reported giving birth in the previous year in the ACS were assigned a birth in the previous year in the CHCK. However, with careful research design, these data can provide a new source of longitudinal, nearly full-count data on fertility in the United States.

Author Manuscript
Author Manuscript

Author Manuscript
Genadek et al.
Page 12