Data Gathering

Data Gathering is the initial step carrying out the data science project. This section covers what data sources, its description and methodology. This project involves 5 data sources which are US Census Bureau, World Bank, OECD, U.S. Citizenship and Immigration Services, and Migration Policy Institute. The collected data are saved into csv files for later use.

US Census Bureau

Using Census Microdata API, the data is collected from PUMS(Public Use Microdata Sample) of American Community Survey (ACS). This is the main dataset for this project, gathering the indicators of immigrants and native-born Americans. The dataset is 2021 ACS 1-Year Estimates and each row represents one individual or household. Extracted dataset has 12 variables which are following.

CIT: Citizenship status
NATIVITY: Nativity
AGEP: Age
POBP: Place of birth
DECADE: Decade of entry
ENG: Ability to speack English
MAR: Marital status
RAC1P: Recorded detailed race code
SEX: Sex
WAGP: Wages or salary income past 12 months
ESR: Employment status recode
SCHL: Educational attainment

Among those variables, WAGP and SCHL will be mainly used for deciding a person is successful. AGEP variable is subsetted in API query to collect only those who are 20 years old or older because those who don’t are out of scope of this project. For more information about the variables such as a range or values, you can go here.

Code

import requests
import json
import pandas as pd

# Get datatset using API query
acs_raw = requests.get("https://api.census.gov/data/2021/acs/acs1/pums?get=CIT,NATIVITY,POBP,DECADE,ENG,MAR,RAC1P,SEX,ESR,WAGP,SCHL&AGEP=20:99")
acs_raw = acs_raw.json()
acs_raw = pd.DataFrame(acs_raw)

# Make first row as a column header
acs_raw.columns = acs_raw.iloc[0]
acs_raw = acs_raw[1:]
print("The dataset has",acs_raw.shape[0],"rows and",acs_raw.shape[1],"columns.")
acs_raw.head()

# acs_raw.to_csv("./data/acs_raw.csv", index=False)

The dataset has 2533139 rows and 12 columns.

	CIT	NATIVITY	POBP	MAR	RAC1P	SEX	ESR	SCHL	AGEP
1	1	1	004	1	1	1	6	11	36
2	1	1	039	5	1	1	6	22	57
3	1	1	046	5	5	1	6	14	29
4	1	1	006	5	1	1	6	1	26
5	1	1	006	2	1	2	6	21	80

World Bank & OECD

As immigrants come from many countries around the world, it is important to have a standard how they are doing compared to their national average. World Bank provides data API for World Development Indicators including economic, environmental, or social indicators on countries. This API is used to collect national employment to population ratio SL.EMP.WORK.ZS. The collected dataset includes 217 countries in a 30-year span between 1991-2021.

National average wage and education attainment are another important indicator for quantifying success. The datasets for those indicators are manually collected from OECD. Annual average wage dataset includes 38 countries’s annual avarage wage in USD. Adult education level dataset provides education attainment ratio among 25-64 year-olds. There are three subjects which are “Below upper secondary”(BUPPSRY), “Tertiary”(TRY), and “Upper secondar”(UPPSRY) featuring 48 countries. The span of these two are same as employment rate data.

Employment rate

Code

library(wbstats)
library(tidyr)
# Employment to poluation ratio
employment = wb_data(indicator = "SL.EMP.TOTL.SP.ZS", start_date = 1991, end_date = 2021, return_wide = FALSE)
employment = employment[, -c(1,2,3,8,9,10,11)]
employment = pivot_wider(employment, id_cols = c(iso3c,country), names_from = date, values_from = value)
dim(employment)
head(employment)

# write.csv(employment, file = "./data/employment_raw.csv", row.names = FALSE)

A tibble: 6 × 33
iso3c	country	2021	2020	2019	2018	2017	2016	2015	2014	...	2000	1999	1998	1997	1996	1995	1994	1993	1992	1991
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
AFG	Afghanistan	NA	36.711	42.223	42.128	42.021	42.475	42.930	43.374	...	42.992	42.977	42.971	42.984	42.985	42.993	43.109	43.152	43.168	43.207
ALB	Albania	51.367	49.417	53.392	52.006	50.152	48.473	45.960	43.778	...	49.013	48.335	49.594	52.167	53.263	53.477	50.446	48.194	45.746	58.067
DZA	Algeria	35.350	34.530	37.369	37.415	37.496	37.599	37.034	36.580	...	30.789	31.625	32.557	33.295	32.111	30.576	32.360	32.808	33.356	34.904
ASM	American Samoa	NA	NA	NA	NA	NA	NA	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AND	Andorra	NA	NA	NA	NA	NA	NA	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AGO	Angola	68.558	68.545	69.874	69.883	69.906	69.888	69.902	69.920	...	74.080	74.082	74.089	74.088	74.088	74.060	74.012	73.916	73.950	73.936

Annual average wage

Code

wage = read.csv("./data/DP_LIVE_09102023230706933.csv")
wage = wage[,c(1,6,7)]
colnames(wage) = c('iso3c','TIME','Value')
wage = pivot_wider(wage, id_cols = "iso3c", names_from = TIME, values_from = Value)
wage = subset(wage, !(iso3c %in% c("OECD")))
dim(wage)
head(wage)

# write.csv(wage, file = "./data/wage_raw.csv", row.names = FALSE)

A tibble: 6 × 32
iso3c	1991	1992	1993	1994	1995	1996	1997	1998	1999	...	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
AUS	42309.38	43173.77	43578.24	43873.86	43715	44976.28	46355.56	46999.64	48064.87	...	57752	57579	58102	57744	57885	57619	57843	58620	60377	60681.50
AUT	52697.40	53759.06	54197.37	54743.42	55184	54819.62	54346.95	56101.14	57354.90	...	62515	62568	62801	63231	63860	63856	64101	64623	64648	65402.32
BEL	53018.64	54718.73	55868.83	56952.78	56759	57470.56	58203.38	58238.28	61484.41	...	64461	65099	65461	65017	65157	64700	65083	65700	63677	65520.82
CAN	42426.30	43045.90	42952.72	42464.18	42369	42745.09	44053.77	44793.21	45157.23	...	53717	54286	54995	55400	54350	55122	56083	56370	59160	59568.78
CZE	NA	NA	NA	NA	16612	18320.99	18606.94	18603.83	19344.55	...	28934	28754	29360	30224	31291	32891	34633	36071	35552	36269.71
DNK	47454.31	49053.26	49596.11	47871.63	48894	49781.44	50577.45	51462.34	52066.95	...	61121	61548	62723	63828	63894	64281	64765	65222	66203	66503.05

Adult education attainment rate

Code

education = read.csv("./data/DP_LIVE_09102023230710370.csv")
education = education[,c(1,3,6,7)]
colnames(education) = c('iso3c','SUBJECT','TIME','Value')
education = pivot_wider(education, id_cols = c(iso3c,SUBJECT), names_from = TIME, values_from = Value)
education = subset(education, !(iso3c %in% c("G20","OAVG")))
dim(education)
head(education)

# write.csv(education, file = "./data/education_raw.csv", row.names = FALSE)

A tibble: 6 × 33
iso3c	SUBJECT	1991	1993	1994	1995	1997	1998	1999	2000	...	2014	2015	2016	2017	2018	2019	2020	2021	1992	1996
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
AUS	BUPPSRY	44.12706	47.15905	49.80203	44.93585	46.69527	43.95596	42.61008	41.21231	...	22.89584	20.98090	20.06735	19.01266	18.10864	17.13266	16.24498	15.46888	NA	NA
AUS	TRY	31.15881	22.46919	23.07237	24.31545	24.30127	25.42071	26.69340	27.47575	...	41.90185	42.88876	43.74390	45.35567	45.72748	47.12998	49.33745	49.76787	NA	NA
AUS	UPPSRY	24.71413	30.37176	27.12561	30.74870	29.00345	30.62333	30.69652	31.31195	...	35.20231	36.13035	36.18875	35.63167	36.16388	35.73737	34.41757	34.76325	NA	NA
AUT	BUPPSRY	NA	NA	NA	NA	NA	NA	NA	NA	...	16.14400	15.35130	15.47159	15.03758	14.70253	14.43586	14.34133	14.06036	NA	NA
AUT	TRY	NA	NA	NA	NA	NA	NA	NA	NA	...	29.90490	30.55073	31.38396	32.39439	32.71143	33.77378	34.20589	34.60450	NA	NA
AUT	UPPSRY	NA	NA	NA	NA	NA	NA	NA	NA	...	53.95110	54.09797	53.14444	52.56804	52.58605	51.79036	51.45277	51.33514	NA	NA

USCIS

Being immigrants is a hard process as they go over many processes, including a submitting an application form. This application form is an initial filter that the U.S. uses to determine who can be in the U.S. Therefore it is important to look at the form whether it is reflecting well on applicants characteristics in terms of success.

N-400: Application for Naturalization

Migration Policy Institute

On top of those recored data, this project will engage with text data. Migration Policy Institute provides a overall report on immigrants from specific countries, going over some key points such as English Proficiency, Age, Education, Employment, Income and Poverty, Immigration Pathways and Naturalization. This report will be a good resource to show the gerneral trends of those immigrants. Among many immigrants, this project is looking at four specific immigrants from Korea, Canada, Mexico, and Colombia.

Code

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import nltk

# Define the user agent to mimic a web browser
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'

# Set headers for the HTTP request
headers = {'User-Agent': user_agent}

# Define URLs for different articles
mex_url = "https://www.migrationpolicy.org/article/mexican-immigrants-united-states"
kor_url = "https://www.migrationpolicy.org/article/korean-immigrants-united-states"
col_url = "https://www.migrationpolicy.org/article/colombian-immigrants-united-states"
can_url = "https://www.migrationpolicy.org/article/canadian-immigrants-united-states"
url_list = [mex_url,col_url,can_url,kor_url]

# Define IDs for subsections in the articles
id1 = 'english'
id2 = 'employment'
id3 = 'poverty'
id4 = 'pathways'
id5 = 'unauthorized'
id6 = 'health'
id7 = 'diaspora'
id_list = [id1,id2,id3,id4,id5,id6,id7]

# Function to extract text between two specified elements
def get_subtext(id1, id2):
    start_element = soup.find_all("a",id=re.compile(id1))[0]
    end_element = soup.find_all("a",id=re.compile(id2))[0]
    # Initialize an empty string to store the extracted text
    extracted_text = ""
    current_element = start_element.find_next()
    # Iterate through the siblings between the two elements
    while current_element.get_text() != end_element.find_previous().get_text():
        extracted_text = extracted_text + current_element.get_text() + " "
        current_element = current_element.find_next_sibling()
        if current_element.name == "div":
            current_element = current_element.find_next_sibling()      
    return extracted_text

# Loop through each URL and extract relevant text
txts = []
for i in url_list:
    response = requests.get(i, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    page_txt = ""
    for j in range(len(id_list)-1):
        try:
            sub_text = get_subtext(id_list[j],id_list[j+1])
            page_txt += sub_text
        except:
            continue
    txts.append(page_txt)

labels = ["Mexico", "Colombia", "Canada", "Korea"]
df_txt = pd.DataFrame()

# Loop through each article, tokenize sentences, and create a DataFrame
for i in range(4):
    corpus=[]  # list of strings (input variables X)
    targets=[] # list of targets (labels or response variables Y)

    sentences=nltk.tokenize.sent_tokenize(txts[i])
    counter=0
    min_sentence_length=20
    text_chunk=''

    for sentence in sentences:
        # Remove any douvle spaces
        text_chunk=' '.join(sentence.split()).strip()
        corpus.append(text_chunk)

    tmp=[]
    for j in range(0,len(corpus)):
        tmp.append(corpus[j])
    df=pd.DataFrame(tmp)
    df=df.rename(columns={0: "text"})
    df["label"] = labels[i]
    df_txt = pd.concat([df_txt, df], axis=0)
    
df_txt.head()

# df_txt.to_csv("./data/MPI_raw.csv", index=False)

	text	label
0	Mexican immigrants are less likely to be profi...	Mexico
1	In 2021, about 65 percent of Mexicans ages 5 a...	Mexico
2	Approximately 6 percent of Mexican immigrants ...	Mexico
3	Note: Limited English Proficient (LEP) status ...	Mexico
4	Their median age was about 46 years old, compa...	Mexico