Data Gathering is the initial step carrying out the data science project. This section covers what data sources, its description and methodology. This project involves 5 data sources which are US Census Bureau, World Bank, OECD, U.S. Citizenship and Immigration Services, and Migration Policy Institute. The collected data are saved into csv files for later use.
US Census Bureau
Using Census Microdata API, the data is collected from PUMS(Public Use Microdata Sample) of American Community Survey (ACS). This is the main dataset for this project, gathering the indicators of immigrants and native-born Americans. The dataset is 2021 ACS 1-Year Estimates and each row represents one individual or household. Extracted dataset has 12 variables which are following.
CIT: Citizenship status NATIVITY: Nativity AGEP: Age POBP: Place of birth DECADE: Decade of entry ENG: Ability to speack English MAR: Marital status RAC1P: Recorded detailed race code SEX: Sex WAGP: Wages or salary income past 12 months ESR: Employment status recode SCHL: Educational attainment
Among those variables, WAGP and SCHL will be mainly used for deciding a person is successful. AGEP variable is subsetted in API query to collect only those who are 20 years old or older because those who don’t are out of scope of this project. For more information about the variables such as a range or values, you can go here.
Code
import requestsimport jsonimport pandas as pd# Get datatset using API queryacs_raw = requests.get("https://api.census.gov/data/2021/acs/acs1/pums?get=CIT,NATIVITY,POBP,DECADE,ENG,MAR,RAC1P,SEX,ESR,WAGP,SCHL&AGEP=20:99")acs_raw = acs_raw.json()acs_raw = pd.DataFrame(acs_raw)# Make first row as a column headeracs_raw.columns = acs_raw.iloc[0]acs_raw = acs_raw[1:]print("The dataset has",acs_raw.shape[0],"rows and",acs_raw.shape[1],"columns.")acs_raw.head()# acs_raw.to_csv("./data/acs_raw.csv", index=False)
The dataset has 2533139 rows and 12 columns.
CIT
NATIVITY
POBP
DECADE
ENG
MAR
RAC1P
SEX
ESR
WAGP
SCHL
AGEP
1
1
1
004
0
0
1
1
1
6
0
11
36
2
1
1
039
0
0
5
1
1
6
0
22
57
3
1
1
046
0
0
5
5
1
6
0
14
29
4
1
1
006
0
0
5
1
1
6
0
1
26
5
1
1
006
0
0
2
1
2
6
0
21
80
World Bank & OECD
As immigrants come from many countries around the world, it is important to have a standard how they are doing compared to their national average. World Bank provides data API for World Development Indicators including economic, environmental, or social indicators on countries. This API is used to collect national employment to population ratio SL.EMP.WORK.ZS. The collected dataset includes 217 countries in a 30-year span between 1991-2021.
National average wage and education attainment are another important indicator for quantifying success. The datasets for those indicators are manually collected from OECD. Annual average wage dataset includes 38 countries’s annual avarage wage in USD. Adult education level dataset provides education attainment ratio among 25-64 year-olds. There are three subjects which are “Below upper secondary”(BUPPSRY), “Tertiary”(TRY), and “Upper secondar”(UPPSRY) featuring 48 countries. The span of these two are same as employment rate data.
Being immigrants is a hard process as they go over many processes, including a submitting an application form. This application form is an initial filter that the U.S. uses to determine who can be in the U.S. Therefore it is important to look at the form whether it is reflecting well on applicants characteristics in terms of success.
On top of those recored data, this project will engage with text data. Migration Policy Institute provides a overall report on immigrants from specific countries, going over some key points such as English Proficiency, Age, Education, Employment, Income and Poverty, Immigration Pathways and Naturalization. This report will be a good resource to show the gerneral trends of those immigrants. Among many immigrants, this project is looking at four specific immigrants from Korea, Canada, Mexico, and Colombia.
Code
from bs4 import BeautifulSoupimport requestsimport reimport pandas as pdimport nltk# Define the user agent to mimic a web browseruser_agent ='Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'# Set headers for the HTTP requestheaders = {'User-Agent': user_agent}# Define URLs for different articlesmex_url ="https://www.migrationpolicy.org/article/mexican-immigrants-united-states"kor_url ="https://www.migrationpolicy.org/article/korean-immigrants-united-states"col_url ="https://www.migrationpolicy.org/article/colombian-immigrants-united-states"can_url ="https://www.migrationpolicy.org/article/canadian-immigrants-united-states"url_list = [mex_url,col_url,can_url,kor_url]# Define IDs for subsections in the articlesid1 ='english'id2 ='employment'id3 ='poverty'id4 ='pathways'id5 ='unauthorized'id6 ='health'id7 ='diaspora'id_list = [id1,id2,id3,id4,id5,id6,id7]# Function to extract text between two specified elementsdef get_subtext(id1, id2): start_element = soup.find_all("a",id=re.compile(id1))[0] end_element = soup.find_all("a",id=re.compile(id2))[0]# Initialize an empty string to store the extracted text extracted_text ="" current_element = start_element.find_next()# Iterate through the siblings between the two elementswhile current_element.get_text() != end_element.find_previous().get_text(): extracted_text = extracted_text + current_element.get_text() +" " current_element = current_element.find_next_sibling()if current_element.name =="div": current_element = current_element.find_next_sibling() return extracted_text# Loop through each URL and extract relevant texttxts = []for i in url_list: response = requests.get(i, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') page_txt =""for j inrange(len(id_list)-1):try: sub_text = get_subtext(id_list[j],id_list[j+1]) page_txt += sub_textexcept:continue txts.append(page_txt)labels = ["Mexico", "Colombia", "Canada", "Korea"]df_txt = pd.DataFrame()# Loop through each article, tokenize sentences, and create a DataFramefor i inrange(4): corpus=[] # list of strings (input variables X) targets=[] # list of targets (labels or response variables Y) sentences=nltk.tokenize.sent_tokenize(txts[i]) counter=0 min_sentence_length=20 text_chunk=''for sentence in sentences:# Remove any douvle spaces text_chunk=' '.join(sentence.split()).strip() corpus.append(text_chunk) tmp=[]for j inrange(0,len(corpus)): tmp.append(corpus[j]) df=pd.DataFrame(tmp) df=df.rename(columns={0: "text"}) df["label"] = labels[i] df_txt = pd.concat([df_txt, df], axis=0)df_txt.head()# df_txt.to_csv("./data/MPI_raw.csv", index=False)