**********************************************************

Which definition of migration better fits Facebook expats? A response using Mexican census data

**********************************************************

**********************************************************
Data: 
For this research, we used 2020 Mexican Census data and Facebook data on users extracted from the Facebook API by the project "Using internet-based data to quantify and sample international migrants. Applications to examine recent immigration to Uruguay" funded by the Agencia Nacional de Investigación e Innovación (Uruguay) and the Max Planck Institute in Demographic Research (Germany). 

Full documentation and Census data is publicly available at the INEGI website (https://www.inegi.org.mx/programas/ccpv/2020/#documentacion). Specifically, under the "Documentation" section, the codes of variables are included, and in particular the country codes used here.

Both the 2020 Mexican Census and Facebook extracts data are consolidated in the database called "base_eng_mod.csv", which we make available for replication.

*********************************************************


**********************************************************
PART I.
Do file 1 in Stata: "1_immigration_definition.do"

*********************************************************

Objective:
With this code, we aim to generate key variables and summary tables from the 2020 Mexican Census data. Our goal is to compare this data with the Facebook expatriate tagging information from extracts from March 2020.

Data sources:
We worked with two datasets: the "Individuals (Personas)" and "Dwelling (Vivienda)" datasets extracted from the extended questionnaire of the 2020 Mexican Census microdata. These data are accessible at the INEGI website: https://www.inegi.org.mx/programas/ccpv/2020/#microdatos

Sample size for individuals:
Within our census data, the sample size for individuals is 15,015,683.

Our Step-by-Step Process:

Importing and Saving Individuals’ Data:
First, we import the "Personas00.CSV" dataset and save it in ".dta" format. Next, we rename certain variables to enhance clarity and uniformity.

Selecting Variables of Interest:
Our next action involves selecting the pertinent variables from the individuals’ database, saving the result as "variables_personas.dta."

Importing and Saving Dwelling-level Data:
We import the "Viviendas00.CSV" dwelling dataset, rename the required variables and keep only the specific variables of interest. We save the dataset as "internet.dta."

Merging Databases:
Next, we merge the individuals’ and dwelling databases using the "ID_VIV" identifier.
The "_merge" variable is dropped, and the merged database is saved under the name "personas_internet.dta."

Descriptive Statistical Analysis:
We undertake a basic descriptive statistical analysis using the merged database and employ 6 different definitions of immigration. Definitions include (1) Immigrants by country of birth, (2) Recent immigrants by country of residence, (3) Recent immigrants by country of birth, (4) Recent immigrants who in 2015 were residing in their birth country, (5) Recent immigrants residing in 2015 in a country other than where they were born, (6) Recent Mexican returnees

Group by Age, Sex, and Internet access at the dwelling:
Next, we proceed to create distinct subgroups within the dataset, as combinations of sex, age-group, and internet access.

Export Summary Tables:
Finally, we export weighted summary tables for each migrant definition by sex, age and country of origin to Excel. This is accomplished using the "tabout" command that adds on weighted counts using the variable named FACTOR.


**********************************************************
PART II
Script 1 in R, version 4.2.2: "01_models_eng.R"

*********************************************************

R Script for Immigration Analysis and Model Comparison
Our goal is to understand the association between different migration definitions from the census and the expat tagging, considering sex, age, and country of origin. 
The script covers first basic data preprocessing, logarithmic transformations when needed, and estimation of linear regression models for six different immigration definitions.

Packages and Data Loading:
We load the necessary R packages required for data manipulation, visualization, and model estimation. We also load the primary dataset, which is a consolidated file named "base_eng_mod.csv." 
This dataset contains information from both the 2020 Mexican Census and Facebook API, focusing on key variables of interest.

Logarithmic transformations of variables:
We use the natural logarithm transformation to enhance the interpretation of our results and to take into account non-linear regressions. 

Defining and Estimating Regression Models:
We estimate linear regression models for six different migration definitions. For each definition, we add covariates sequentially and estimate a series of nested models. These covariates include demographic variables like sex, age, and country of origin, as well as internet penetration rates. 
We use the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) calculations to compare models and assess the relative goodness of fit.

Generating Summary Tables:
Finally, we generate a summary table (Table 2 in manuscript) using the 'stargazer' package. This table presents regression results for all six migration definitions and includes AIC and BIC values for model comparison. 

**********************************************************
PART III
Script 2 in R version 4.2.2: "02_graphs_eng.R"

*********************************************************

Generate Graphs to Compare models

In this section, we create graphs to visualize our results. We also compare different models to understand the relationships and trends within our data.

Data Import and Preparation:
We begin by importing our data from the "bases" directory using the read_csv function.

Figure 1: Internet Penetration Rate for Recent Migrants (Definition 2):
We generate a graph on the internet penetration rate for recent migrants under Definition 2. The plot displays how proportion of internet access varies by country of origin and age-group, with color-coded points for men and women. 

Package Setup:
We load the dotwhisker package, essential for creating coefficient plots.

Model Estimation:
We re-estimate our six linear regression models with different definitions of migration.

Figure 2: Coefficient Plots for Model Comparison:
We generate coefficient plots using dwplot to compare models. Each plot includes the estimated coefficients with confidence intervals. 

**********************************************************
End
*********************************************************