Data Gathering

Data Gathering is the initial step carrying out the data science project. This section covers what data sources, its description and methodology. This project involves 5 data sources which are US Census Bureau, World Bank, OECD, U.S. Citizenship and Immigration Services, and Migration Policy Institute. The collected data are saved into csv files for later use.


US Census Bureau

Using Census Microdata API, the data is collected from PUMS(Public Use Microdata Sample) of American Community Survey (ACS). This is the main dataset for this project, gathering the indicators of immigrants and native-born Americans. The dataset is 2021 ACS 1-Year Estimates and each row represents one individual or household. Extracted dataset has 12 variables which are following.

CIT: Citizenship status
NATIVITY: Nativity
AGEP: Age
POBP: Place of birth
DECADE: Decade of entry
ENG: Ability to speack English
MAR: Marital status
RAC1P: Recorded detailed race code
SEX: Sex
WAGP: Wages or salary income past 12 months
ESR: Employment status recode
SCHL: Educational attainment

Among those variables, WAGP and SCHL will be mainly used for deciding a person is successful. AGEP variable is subsetted in API query to collect only those who are 20 years old or older because those who don’t are out of scope of this project. For more information about the variables such as a range or values, you can go here.

Code
import requests
import json
import pandas as pd

# Get datatset using API query
acs_raw = requests.get("https://api.census.gov/data/2021/acs/acs1/pums?get=CIT,NATIVITY,POBP,DECADE,ENG,MAR,RAC1P,SEX,ESR,WAGP,SCHL&AGEP=20:99")
acs_raw = acs_raw.json()
acs_raw = pd.DataFrame(acs_raw)

# Make first row as a column header
acs_raw.columns = acs_raw.iloc[0]
acs_raw = acs_raw[1:]
print("The dataset has",acs_raw.shape[0],"rows and",acs_raw.shape[1],"columns.")
acs_raw.head()

# acs_raw.to_csv("./data/acs_raw.csv", index=False)
The dataset has 2533139 rows and 12 columns.
CIT NATIVITY POBP DECADE ENG MAR RAC1P SEX ESR WAGP SCHL AGEP
1 1 1 004 0 0 1 1 1 6 0 11 36
2 1 1 039 0 0 5 1 1 6 0 22 57
3 1 1 046 0 0 5 5 1 6 0 14 29
4 1 1 006 0 0 5 1 1 6 0 1 26
5 1 1 006 0 0 2 1 2 6 0 21 80

World Bank & OECD

As immigrants come from many countries around the world, it is important to have a standard how they are doing compared to their national average. World Bank provides data API for World Development Indicators including economic, environmental, or social indicators on countries. This API is used to collect national employment to population ratio SL.EMP.WORK.ZS. The collected dataset includes 217 countries in a 30-year span between 1991-2021.

National average wage and education attainment are another important indicator for quantifying success. The datasets for those indicators are manually collected from OECD. Annual average wage dataset includes 38 countries’s annual avarage wage in USD. Adult education level dataset provides education attainment ratio among 25-64 year-olds. There are three subjects which are “Below upper secondary”(BUPPSRY), “Tertiary”(TRY), and “Upper secondar”(UPPSRY) featuring 48 countries. The span of these two are same as employment rate data.

Employment rate

Code
library(wbstats)
library(tidyr)
# Employment to poluation ratio
employment = wb_data(indicator = "SL.EMP.TOTL.SP.ZS", start_date = 1991, end_date = 2021, return_wide = FALSE)
employment = employment[, -c(1,2,3,8,9,10,11)]
employment = pivot_wider(employment, id_cols = c(iso3c,country), names_from = date, values_from = value)
dim(employment)
head(employment)

# write.csv(employment, file = "./data/employment_raw.csv", row.names = FALSE)
  1. 217
  2. 33
A tibble: 6 × 33
iso3c country 2021 2020 2019 2018 2017 2016 2015 2014 ... 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
AFG Afghanistan NA 36.711 42.223 42.128 42.021 42.475 42.930 43.374 ... 42.992 42.977 42.971 42.984 42.985 42.993 43.109 43.152 43.168 43.207
ALB Albania 51.367 49.417 53.392 52.006 50.152 48.473 45.960 43.778 ... 49.013 48.335 49.594 52.167 53.263 53.477 50.446 48.194 45.746 58.067
DZA Algeria 35.350 34.530 37.369 37.415 37.496 37.599 37.034 36.580 ... 30.789 31.625 32.557 33.295 32.111 30.576 32.360 32.808 33.356 34.904
ASM American Samoa NA NA NA NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
AND Andorra NA NA NA NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
AGO Angola 68.558 68.545 69.874 69.883 69.906 69.888 69.902 69.920 ... 74.080 74.082 74.089 74.088 74.088 74.060 74.012 73.916 73.950 73.936

Annual average wage

Code
wage = read.csv("./data/DP_LIVE_09102023230706933.csv")
wage = wage[,c(1,6,7)]
colnames(wage) = c('iso3c','TIME','Value')
wage = pivot_wider(wage, id_cols = "iso3c", names_from = TIME, values_from = Value)
wage = subset(wage, !(iso3c %in% c("OECD")))
dim(wage)
head(wage)

# write.csv(wage, file = "./data/wage_raw.csv", row.names = FALSE)
  1. 38
  2. 32
A tibble: 6 × 32
iso3c 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
AUS 42309.38 43173.77 43578.24 43873.86 43715 44976.28 46355.56 46999.64 48064.87 ... 57752 57579 58102 57744 57885 57619 57843 58620 60377 60681.50
AUT 52697.40 53759.06 54197.37 54743.42 55184 54819.62 54346.95 56101.14 57354.90 ... 62515 62568 62801 63231 63860 63856 64101 64623 64648 65402.32
BEL 53018.64 54718.73 55868.83 56952.78 56759 57470.56 58203.38 58238.28 61484.41 ... 64461 65099 65461 65017 65157 64700 65083 65700 63677 65520.82
CAN 42426.30 43045.90 42952.72 42464.18 42369 42745.09 44053.77 44793.21 45157.23 ... 53717 54286 54995 55400 54350 55122 56083 56370 59160 59568.78
CZE NA NA NA NA 16612 18320.99 18606.94 18603.83 19344.55 ... 28934 28754 29360 30224 31291 32891 34633 36071 35552 36269.71
DNK 47454.31 49053.26 49596.11 47871.63 48894 49781.44 50577.45 51462.34 52066.95 ... 61121 61548 62723 63828 63894 64281 64765 65222 66203 66503.05

Adult education attainment rate

Code
education = read.csv("./data/DP_LIVE_09102023230710370.csv")
education = education[,c(1,3,6,7)]
colnames(education) = c('iso3c','SUBJECT','TIME','Value')
education = pivot_wider(education, id_cols = c(iso3c,SUBJECT), names_from = TIME, values_from = Value)
education = subset(education, !(iso3c %in% c("G20","OAVG")))
dim(education)
head(education)

# write.csv(education, file = "./data/education_raw.csv", row.names = FALSE)
  1. 142
  2. 33
A tibble: 6 × 33
iso3c SUBJECT 1991 1993 1994 1995 1997 1998 1999 2000 ... 2014 2015 2016 2017 2018 2019 2020 2021 1992 1996
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
AUS BUPPSRY 44.12706 47.15905 49.80203 44.93585 46.69527 43.95596 42.61008 41.21231 ... 22.89584 20.98090 20.06735 19.01266 18.10864 17.13266 16.24498 15.46888 NA NA
AUS TRY 31.15881 22.46919 23.07237 24.31545 24.30127 25.42071 26.69340 27.47575 ... 41.90185 42.88876 43.74390 45.35567 45.72748 47.12998 49.33745 49.76787 NA NA
AUS UPPSRY 24.71413 30.37176 27.12561 30.74870 29.00345 30.62333 30.69652 31.31195 ... 35.20231 36.13035 36.18875 35.63167 36.16388 35.73737 34.41757 34.76325 NA NA
AUT BUPPSRY NA NA NA NA NA NA NA NA ... 16.14400 15.35130 15.47159 15.03758 14.70253 14.43586 14.34133 14.06036 NA NA
AUT TRY NA NA NA NA NA NA NA NA ... 29.90490 30.55073 31.38396 32.39439 32.71143 33.77378 34.20589 34.60450 NA NA
AUT UPPSRY NA NA NA NA NA NA NA NA ... 53.95110 54.09797 53.14444 52.56804 52.58605 51.79036 51.45277 51.33514 NA NA

USCIS

Being immigrants is a hard process as they go over many processes, including a submitting an application form. This application form is an initial filter that the U.S. uses to determine who can be in the U.S. Therefore it is important to look at the form whether it is reflecting well on applicants characteristics in terms of success.

N-400: Application for Naturalization

Migration Policy Institute

On top of those recored data, this project will engage with text data. Migration Policy Institute provides a overall report on immigrants from specific countries, going over some key points such as English Proficiency, Age, Education, Employment, Income and Poverty, Immigration Pathways and Naturalization. This report will be a good resource to show the gerneral trends of those immigrants. Among many immigrants, this project is looking at four specific immigrants from Korea, Canada, Mexico, and Colombia.

Code
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import nltk

# Define the user agent to mimic a web browser
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'

# Set headers for the HTTP request
headers = {'User-Agent': user_agent}

# Define URLs for different articles
mex_url = "https://www.migrationpolicy.org/article/mexican-immigrants-united-states"
kor_url = "https://www.migrationpolicy.org/article/korean-immigrants-united-states"
col_url = "https://www.migrationpolicy.org/article/colombian-immigrants-united-states"
can_url = "https://www.migrationpolicy.org/article/canadian-immigrants-united-states"
url_list = [mex_url,col_url,can_url,kor_url]

# Define IDs for subsections in the articles
id1 = 'english'
id2 = 'employment'
id3 = 'poverty'
id4 = 'pathways'
id5 = 'unauthorized'
id6 = 'health'
id7 = 'diaspora'
id_list = [id1,id2,id3,id4,id5,id6,id7]

# Function to extract text between two specified elements
def get_subtext(id1, id2):
    start_element = soup.find_all("a",id=re.compile(id1))[0]
    end_element = soup.find_all("a",id=re.compile(id2))[0]
    # Initialize an empty string to store the extracted text
    extracted_text = ""
    current_element = start_element.find_next()
    # Iterate through the siblings between the two elements
    while current_element.get_text() != end_element.find_previous().get_text():
        extracted_text = extracted_text + current_element.get_text() + " "
        current_element = current_element.find_next_sibling()
        if current_element.name == "div":
            current_element = current_element.find_next_sibling()      
    return extracted_text

# Loop through each URL and extract relevant text
txts = []
for i in url_list:
    response = requests.get(i, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    page_txt = ""
    for j in range(len(id_list)-1):
        try:
            sub_text = get_subtext(id_list[j],id_list[j+1])
            page_txt += sub_text
        except:
            continue
    txts.append(page_txt)

labels = ["Mexico", "Colombia", "Canada", "Korea"]
df_txt = pd.DataFrame()

# Loop through each article, tokenize sentences, and create a DataFrame
for i in range(4):
    corpus=[]  # list of strings (input variables X)
    targets=[] # list of targets (labels or response variables Y)

    sentences=nltk.tokenize.sent_tokenize(txts[i])
    counter=0
    min_sentence_length=20
    text_chunk=''

    for sentence in sentences:
        # Remove any douvle spaces
        text_chunk=' '.join(sentence.split()).strip()
        corpus.append(text_chunk)

    tmp=[]
    for j in range(0,len(corpus)):
        tmp.append(corpus[j])
    df=pd.DataFrame(tmp)
    df=df.rename(columns={0: "text"})
    df["label"] = labels[i]
    df_txt = pd.concat([df_txt, df], axis=0)
    
df_txt.head()

# df_txt.to_csv("./data/MPI_raw.csv", index=False)
text label
0 Mexican immigrants are less likely to be profi... Mexico
1 In 2021, about 65 percent of Mexicans ages 5 a... Mexico
2 Approximately 6 percent of Mexican immigrants ... Mexico
3 Note: Limited English Proficient (LEP) status ... Mexico
4 Their median age was about 46 years old, compa... Mexico
Back to top