author: Matt Nelson
title: The decline of patrilineal kin propinquity in the United States, 1790–1940 


**********DATA ACCESS OVERVIEW**********


**To Download the Data**
	To request restricted versions of the IPUMS data, interested users will need their institution to submit a data license application. Interested users should contact ipumsres@umn.edu for more information. The data is currently free of charge, but may change in the future.

	The license requires users to submit a data security plan on protecting the data, a research project application, and a signed researcher agreement form.

	County boundaries can be downloaded from NHGIS. The 2000 Tiger Lines 1790-1940 (excluding 1890) were downloaded and loaded into ArcGIS. This data was combinged in ArcGIS using the variable COUNTYNHG (when available) in IPUMS-USA data and GISJOIN2 in NHGIS data.
	
	NOTE: The data is continually updated. Any data downloads after May 2020 may not have the exact same results because of updates to the data by IPUMS. For approved restricted data users, the original data used in this analysis will be preserved by IPUMS to recreate as necessary.

**Programs and versions used for analysis**
	Stata 16.0 (syntax attached)
	ArcMap 10.4.1 (used to create Figures 8-10)
	Excel 16.0.4993.1002 (used to create remaining Tables and Figures, attached)
	
**********FOLDER STRUCTURE**********


dem_res
	data
		1790-Data for 1790 is stored here
		1800-Data for 1800 is stored here
		1810-Data for 1810 is stored here
		1820-Data for 1820 is stored here
		1830-Data for 1830 is stored here
		1840-Data for 1840 is stored here
		1850-Data for 1850 is stored here
		1860-Data for 1860 is stored here
		1870-Data for 1870 is stored here
		1880-Data for 1880 is stored here
		1900-Data for 1900 is stored here
		1910-Data for 1910 is stored here
		1920-Data for 1920 is stored here
		1930-Data for 1930 is stored here
		1940-Data for 1940 is stored here
	logs-Any logs written out are stored here
	shapefiles
		1790-Shapefile for 1790 is stored here
		1800-Shapefile for 1800 is stored here
		1810-Shapefile for 1810 is stored here
		1820-Shapefile for 1820 is stored here
		1830-Shapefile for 1830 is stored here
		1840-Shapefile for 1840 is stored here
		1850-Shapefile for 1850 is stored here
		1860-Shapefile for 1860 is stored here
		1870-Shapefile for 1870 is stored here
		1880-Shapefile for 1880 is stored here
		1900-Shapefile for 1900 is stored here
		1910-Shapefile for 1910 is stored here
		1920-Shapefile for 1920 is stored here
		1930-Shapefile for 1930 is stored here
		1940-Shapefile for 1940 is stored here
	syntax
		1-read_data-Syntax files to read in data stored here
		2-create_surname_links-Syntax files to create kin links stored here
		3-nhgisjoin-Syntax files to create NHGIS join code stored here
		4-living_arrangements-Syntax file to create living arrangements for elderly persons stored here
		5-pop_density-Syntax file to create county population density stored here
		6-standardize-Syntax file to standardize all files stored here
		7-imp_files-Syntax files to create impute files stored here
		8-adj_rates-Syntax files to adjust kin propinquity rates stored here
			states-This folder stores state specific syntax files that are read in from the file 8-adj_rates_cs.do
		9-analysis-Syntax files to analyze data to create tables and figures stored here
	tables-Excel file and ArcGIS file for creating tables/figure stored here
	temp_data-Tenporary storage for logs, data, etc


**********DATA**********	


**Samples**
	1790-419,317 households
	1800-542,070 households
	1810-829,388 households
	1820-1,238,140 households
	1830-1,824,028 households
	1840-2,585,889 households
	1850-19,443,785 persons
	1860-26,895,337 persons
	1870-37,643,496 persons
	1880-49,020,953 persons
	1900-73,779,794 persons
	1910-89,299,714 persons
	1920-101,337,837 persons
	1930-118,750,570 persons
	1940-128,250,989 persons

**Universe**
	All non-GQ households

**Variable Documentation**
	Full documentation for the following variables is avaialble on the IPUMS-USA website. These are the variables required to create the kin propinquity measure,

	1790-1840
		imageid-Image identifier, used to sort the data, only available in the restricted version
		ycord-Y-coordinate, used to sort the data, only available in the restricted version
		pid-Ancestry identifier, used to sort the data, only available in the restricted version
		fullstate-State string, only available in the restricted version
		county-County string, only available in the restricted version
		township-Township string, only available in the restricted version
		city-City string, only available in the restricted version
		surname-Surname of head of household, only available in the restricted version

	1850-1940
		serial-Serial number for each household. While the serial number for each household is unique by year, it can change between versions of the data. 
		pernum-Person number within household 
		namelast-Surname of indviduals. This variable is only available in the restricted IPUMS data. 
		race-Race of person 
		stateicp-State ICP Codes. Can also use STATEFIP, but the syntax in this file uses STATEICP. 
		countyicp-County ICP Codes. Can also use COUNTYFIP, but the syntax in this file uses COUNTYICP.
		enumdist, source variables-Identifies the enumeration district (1880-1940) or the minor civil division (1850-1870). The source variables are only available in the restricted IPUMS data. Source variables are unique to each year. For example, US1850C_0043 is the township string, however, US1860C_0043 is not. ENUMDIST is used for 1880-1940, but the source variables were used for 1850-1870. 
		age-Age
		sursim-Surname Similarity, indicates individuals who share the same surname within a household. 

	These variables (and variables above) were used in the analysis of the patterns of kin propinquity.
		urban-Urban status. 
		bpl-Birthplace
		fbpl-Father's Birthplace
		mbpl-Mother's Birthplace 
		occ1950-1950 Occupational Code 
		countynhg-identifier used to merge the Census data to the county-level GIS files.
		histid-unique identifier, can be used to identify cases should the data ever change.
		relate-Relationship to Head 
		momloc-Mother's location within household
		poploc-Father's location within household

	The following variables were created by the author (based on the standardized version of the data in file 6_year.dta);
		ed_group-created from stateicp, county, ed 
		ed_name_group_count_f_a-Count of persons within enumeration district that had the same surname by race 
		ed_totpop_f_a-Count of persons within enumeration district by race 
		kinship_serial_1a-Distance above household between matching surnames for all individuals 
		kinship_serial_2a-Distance below household betwen matching surnames for all indviduals
		kinship_serial_1b-Distance above household between matching surnames for all heads of household
		kinship_serial_2b-Distance below household between matching surnames for all heads of household 
		ed_name_group_count_f_b-Count of heads of household within enumeration district that had the same surname by race
		ed_totpop_f_b-Count of heads of household within enumeration district by race 
		namelast_universe-Variable indicating whether a surname is illegible and should not have a probability calculated 
		sursim_head_pernum-Variable indicating the reference person for a family
		sursim_head_age-Variable indicating the age of the reference person for a family 
		from_begin-Number of households a family is from the beginning of the enumeraiton district 
		from_end-Number of households a family is from the end of the enumeration district 
		age_link_1a-Age of matching reference person above household
		age_link_2a-Age of matching reference person below household 
		pop_density-County population density 
		urbtype-Recoding of urban, including extra code for farm families living in rural areas
		intergeneration-An indicator for elderly persons living with adult children
		generations-Variable indicating the number of generations elderly person lives in 
		extend-Variable indicating elderly persons living with extended kin 
		urban_rate-Percentage of county population living within a 1930 Census Urban Area
		C_a_ed_group-The number of other same surname families an average family had within an enumeration district based on Smith, 1989.
		common_rate_a_ed_group-% of families within an enumeration district that have the most common surname within said enumeration district
		distribution_a_ed_group-The distribution of surnames for a particular familiy (Frs/Trs)

**********SYNTAX FILES**********


Because of the size of the files, the author strongly recommends analyzing and running data one year at a time.
A "-" indicates notes on what the syntax file does
A "*" indicates data files that are created from the particular syntax file
run_all.do will run all of the syntax files. This program can take several weeks to run everything from the beginning.


Stata Syntax Files
1-Read files-Read in files using IPUMS syntax, select universes, and create variables
	-Reads in data from IPUMS. Each file is year specific and titled 1-read_data_year.do
	-WReformats data
	-Selects in-universe persons
	-Creates ED-level surname and population counts
	-Creates 5 year age groups
	-Creates urban and family farm indicator
	-Creates foreign-born generation indicator
	-For 1790-1840, uses data created by author (nhgisjoin.dta and year_imp.dta) to correctly map ICPSR geography codes to the data
	
	*Creates data file 1_year.dta

2-Create surname links-Create Kin Propinquity measure
	-Creates surname links within enumeration district using 1_year data. To speed up processing, the program selects a state, then selects an enumeration district. This program can take anywhere from less than a hour to a couple of days depending on the year that is running and the computing environment the syntax is run in.
	-The two files in this folder run 1790-1840 and 1850-1940 separately.
	
	*Creates data file 2_year.dta
	
3-Create NHGIS-Create join code for Census data to NHGIS (specifically 1790-1840, 1900 & 1940, COUNTYNHG can be used for other years) using 2_year data.
	-3-gisjoin2.do creates GISJOIN2 codes for 1850-1940
	-3-icp_codes_1790-1840.do creates GISJOIN2 codes for 1790-1840
	
	*Creates data file 3_year.dta
	
4-Living Arrangments-Create indicators for complex families (three generation, extended, and with adult children) using 3_year data.
	
	*Creates data file 4_year.dta
	
5-Pop Density-Creates the county population densities for each year 
	-Uses NHGIS data from ArcGIS (year_area.xls) to merge county areas to the microdata using 4_year data.

	*Creates data file 5_year.dta
	
6-Standardize-Standardize 1790-1840 & 1850-1940 files for analysis using 4_year data for 1850-1940 and 3_year data for 1790-1840.
	-For 1850-1940, this is the data file users should use to analyze kin propinquity
	-For 1790-1840, data undergoes additional steps
	
	*Creates data file 6_year.dta.
	
7-Impute files-Create files used to adjust bad data for 1790-1840 using 6_year data.
	-Creates files for 1790-1840, and 1850 which is used to impute data for the early years in some cases. Each file is year specific named 7-imp_files_year.do
	
	*Creates data file 7_year.dta
	
8-Adj Rates-Adjust rates for bad data for 1790-1840 using 7_year data.
	-The file 8-adj_rates_ss.do adjusts rates at the state-level
	-The file 8-adj_rates_cs.do adjust rates at the county-level.
	-The file 8-adj_rates_ss.do MUST be run first
	
	*8-adj_rates_ss.do creates the file kinship_ss.dta in the dem_res folder
	*8-adj_rates_cs.do creates the file kinship_cs.dta in the dem_res folder
	
9-Analysis-syntax for recreating tables and figures using 6_year data for 1850-1940 and 8_year data for 1790-1840.
	-Each .do file creates the figure/table/numbers for the particular table or figure.
	
	*figure_8-10.do creates data files 9_year.xlsx to merge into ArcGIS for Figures 8-10


**********TABLES**********


Excel File
This excel file contains the results used to create the tables and figures in the paper. 
Each tab is named for the specific table/figure in the paper. 
Figures 8-10 are not included as they were produced in ArcGIS using the GUI interface.

To produce Figures 8-10-
1. Load NHGIS shapefiles for 1790, 1820, 1850, 1880, 1910, and 1940 into GIS software.
2. Joinby GISJOIN2 in shapefiles and GISJOIN2 in Excel County files for each year.
3. Create Figure in Smybology, selecting "Quantities-Graduated Colors". The value should be kinrate, with 5 classes.
4. For the five bins, the first bin is limited to 0, each bin is then defined as a max of 0.199999, 0.399999, 0.599999, and 1 respectively.
5. The colors were based on colorbrewer2.org, a sequential 4-class OrRd scheme.
6. Each figure is exported (with a title and legend) as a .pdf and combined to create Figures 8-10.