Data Exploration

Exploratory Data Analysis (EDA) is an important step to see the general trends or to get an insight of the dataset. Visulization will provide an evidence of possible patterns or outlilers that need to be dealt with. US Census, World Bank, and OECE data will be used to see the differences between immigrants and native-borns. USCIS and MPI data will be visulized as a word cloud to see the frequency of words used.

General Trends

Data Preparation

# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings

sns.set_theme(palette="Set2")
warnings.filterwarnings("ignore")

# Read data
wage = pd.read_csv("./data/wage_cleaned.csv")
employment = pd.read_csv("./data/employment_cleaned.csv")
education = pd.read_csv("./data/education_cleaned.csv")
acs = pd.read_csv("./data/acs_cleaned.csv")

# Set necessary categorical variables
acs['NATIVITY'] = acs['NATIVITY'].astype('category')
acs['POBP'] = acs['POBP'].astype('category')
acs['DECADE'] = acs['DECADE'].astype('category')
acs['ENG'] = acs['ENG'].astype('category')
acs['MAR'] = acs['MAR'].astype('category')
acs['RAC1P'] = acs['RAC1P'].astype('category')
acs['SEX'] = acs['SEX'].astype('category')
acs['ESR'] = acs['ESR'].astype('category')
acs['SCHL'] = acs['SCHL'].astype('category')
acs['SUCCESS'] = acs['SUCCESS'].astype('category')

Code

fig, axs = plt.subplots(5,2, figsize=(16,16))
sns.histplot(x="ENG", data=acs[acs["NATIVITY"]==1], stat="percent", label="Native-borns", ax=axs[0,0]).set(xlabel='English Proficiency', title='Native-borns')
sns.histplot(x="ENG", data=acs[acs["NATIVITY"]==2], hue="NATIVITY", label="Immigrants", stat="percent",ax=axs[0,1]).set(xlabel='English Proficiency', title='Immigrants')
sns.histplot(x="RAC1P", data=acs[acs["NATIVITY"]==1], stat="percent",label="Native-borns",ax=axs[1,0]).set(xlabel="Racial Categories")
sns.histplot(x="RAC1P", data=acs[acs["NATIVITY"]==2], hue="NATIVITY", label="Immigrants",stat="percent",ax=axs[1,1],legend=False).set(xlabel="Racial Categories")
sns.histplot(x="SCHL", data=acs[acs["NATIVITY"]==1], stat="percent",label="Native-borns",ax=axs[2,0]).set(xlabel="Education Attainment")
sns.histplot(x="SCHL", data=acs[acs["NATIVITY"]==2], hue="NATIVITY", label="Immigrants", stat="percent",ax=axs[2,1],legend=False).set(xlabel="Education Attainment")
sns.histplot(x="MAR", data=acs[acs["NATIVITY"]==1], stat="percent",label="Native-borns",ax=axs[3,0]).set(xlabel="Marital Status")
sns.histplot(x="MAR", data=acs[acs["NATIVITY"]==2], hue="NATIVITY", label="Immigrants", stat="percent",ax=axs[3,1],legend=False).set(xlabel="Marital Status")
sns.histplot(x="ESR", data=acs[acs["NATIVITY"]==1], stat="percent",label="Native-borns",ax=axs[4,0]).set(xlabel="Employment Status")
sns.histplot(x="ESR", data=acs[acs["NATIVITY"]==2], hue="NATIVITY", label="Immigrants", stat="percent",ax=axs[4,1],legend=False).set(xlabel="Employment Status")
axs[0,0].legend()
axs[0,1].legend()
plt.suptitle("Feature Comparison between Native-borns and Immigrants")
plt.tight_layout()
plt.show()

English Proficiency
More than 90% of native-borns speaks only English(0). Among immigrants, only 40% responded that they speak English very well(1).
Racial Category
On racial categories, “White alone” (1) was the majority among native-borns, yet “Asian alone”(6) is the majority followed by “White alone”(1) and “Two or More Races”(9) among immigrants.
Education Attainment
In terms of education attainment, there are more people who got Bachelor’s degree(21) in immigrants than native-borns. This is opposite on “Regular high school diploma”(16).
Marital Status
When it comes to marital status and employment status, both immigrants and native-borns have similar distribution. However, there are slightly more percentage of “Never married”(5) people on native-borns, having “Mariied”(1) as the highest for native-borns and immigrants.
Employmeny Status
Lastly, 50% of both native-borns and immigrants are “Civilian employed, at work”(1) followed by “Not in Labor Force”(6).

Wage

Code

sns.boxplot(x="WAGP",y="NATIVITY",hue="NATIVITY",data=acs)
plt.xlabel("Wage($)")
plt.tight_layout()
plt.show()

As the plot shows, the wage column has many extreme values. Log normalization is used here to make more general by scaling the data into narrower range.

Code

acs.loc[acs['WAGP'] == 0, 'WAGP'] = 2
acs["NORM_WAGP"] = np.log(acs["WAGP"])

print("The median wage difference between native-borns and immigrants is",np.exp(acs[acs["NATIVITY"]==2]["NORM_WAGP"].median())-np.exp(acs[acs["NATIVITY"]==1]["NORM_WAGP"].median()),"in US dollars.")
sns.boxplot(x="NORM_WAGP",y="NATIVITY",hue="NATIVITY",data=acs)
plt.xlabel("Normalized Wage")
plt.tight_layout()
plt.show()

The median wage difference between native-borns and immigrants is 4100.0 in US dollars.

Now we can see the difference on median wage between native-borns and immigrants is clearer than before. However there is no large difference.

Wage vs Other indicators

Code

fig, axs = plt.subplots(3,2, figsize=(16, 16))
sns.pointplot(data=acs, x="SEX", y="NORM_WAGP", hue="NATIVITY", ax=axs[0,0]).set(xlabel="Sex", ylabel = "Normalized Wage")
sns.pointplot(data=acs, x="ENG", y="NORM_WAGP", hue="NATIVITY", ax=axs[1,0]).set(xlabel="English Proficiency", ylabel = "Normalized Wage")
sns.pointplot(data=acs, x="MAR", y="NORM_WAGP", hue="NATIVITY", ax=axs[2,0]).set(xlabel="Marital Status", ylabel = "Normalized Wage")
sns.pointplot(data=acs, x="RAC1P", y="NORM_WAGP", hue="NATIVITY", ax=axs[0,1]).set(xlabel="Racial Categories", ylabel = "Normalized Wage")
sns.pointplot(data=acs, x="SCHL", y="NORM_WAGP", hue="NATIVITY", ax=axs[1,1]).set(xlabel="Education Attainment", ylabel = "Normalized Wage")
sns.pointplot(data=acs, x="ESR", y="NORM_WAGP", hue="NATIVITY", ax=axs[2,1]).set(xlabel="Employment Status", ylabel = "Normalized Wage")
plt.suptitle("Normalized Wage vs Other Indicators")
plt.tight_layout()
plt.show()

Sex
Being a female(2) doesn’t impact on wage on both native-borns and immigrants, but being a male(1) leads to higher wage on immigrants.
Racial Category
“White alone”(1), “Asian alone”(6), and “Two or More Races”(9) native-borns earn more wage than those of immigrants. For immigrants, “Alaska Native alone”(4) group has very high wage than any other combinations followed by “Black or African American alone”(2).
English Proficiency
Immigrants are making more money if they can speak English well enough(0-2). However, if they don’t speak well(3-4), native-borns are making more money.
Education Attainment
On almost every category of education attainment, immigrants earns more wage than native-borns. There are couple things to note. First, there is a quite high wage on people who only finished “Nursery school, preschool”(2). Second, “Bachelor’s degree”(21) and “Associate’s degree”(20) are the only locations where native-borns can compete with immigrants regarding their wage.
Marital Status
Similar to education attainment, immigrants have higher wage in all possible marital status. Being “Widowed”(2) has a significant disadvantage in wage and being “Never married”(5) has a significant advantage in wage.
Employment Status
There is no huge difference in wage between native-borns and immigrants based on their employment status. Employed in “Armed Forces”(4,5) has a high wage value followed by “Civilian employed, at work”(1).

Country Comparison

Since each immigrant comes from different place, the comparison with origin countries brings different perspectives of the success. In this section, we are looking at four different countries, Mexico, Colombia, Canada, and Korea, on their nationa average of wage, education attainment, and employment rate.

Wage

Code

acs["SCHL"] = acs["SCHL"].astype(int)
kor = acs[acs["POBP"]=="Korea"]
mex = acs[acs["POBP"]=="Mexico"]
col = acs[acs["POBP"]=="Colombia"]
can = acs[acs["POBP"]=="Canada"]
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=kor, ax = axs[0,0]).set(xlabel="Normalized Wage", title = "Korea")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=mex, ax = axs[1,0]).set(xlabel="Normalized Wage", title = "Mexico")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=col, ax = axs[0,1]).set(xlabel="Normalized Wage", title = "Colombia")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=can, ax = axs[1,1]).set(xlabel="Normalized Wage", title = "Canada")
axs[0,0].axvline(x=np.log(wage[wage.country == "Korea"].iloc[0,31]), color='red', linestyle='--',label = "National Average")
axs[1,0].axvline(x=np.log(wage[wage.country == "Mexico"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[0,1].axvline(x=np.log(wage[wage.country == "Colombia"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[1,1].axvline(x=np.log(wage[wage.country == "Canada"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
plt.suptitle("Origin Country Comparison on Normalized Wage")
plt.tight_layout()
plt.show()

Red lines represent national averages on each countries. Both native-borns and immigrants from all those four countries have less median wage than their origin residents. As addressed in the prior research in introduction, immigrants from Korea and Canada, relatively more developed countries than Mexico and Colombia, have less median wage than native-borns and even people from Colombia and Mexico.

Education Attainment

Code

kor["edu"] = np.select([kor["SCHL"] < 19, kor["SCHL"] >= 19], ["Below", "Upper"], default=0)
mex["edu"] = np.select([mex["SCHL"] < 19, mex["SCHL"] >= 19], ["Below", "Upper"], default=0)
col["edu"] = np.select([col["SCHL"] < 19, col["SCHL"] >= 19], ["Below", "Upper"], default=0)
can["edu"] = np.select([can["SCHL"] < 19, can["SCHL"] >= 19], ["Below", "Upper"], default=0)
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=kor[kor["NATIVITY"]==2],ax=axs[0,0]).set(xlabel="Education Attainment", title = "Korea")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=mex[mex["NATIVITY"]==2],ax=axs[1,0]).set(xlabel="Education Attainment", title = "Mexico")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=col[col["NATIVITY"]==2],ax=axs[0,1]).set(xlabel="Education Attainment", title = "Colombia")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=can[can["NATIVITY"]==2],ax=axs[1,1]).set(xlabel="Education Attainment", title = "Canada")
axs[0,0].axhline(y=(education[education["country"]=="Korea"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[0,0].axhline(y=(education[education["country"]=="Korea"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[1,0].axhline(y=(education[education["country"]=="Mexico"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[1,0].axhline(y=(education[education["country"]=="Mexico"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[0,1].axhline(y=(education[education["country"]=="Colombia"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[0,1].axhline(y=(education[education["country"]=="Colombia"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[1,1].axhline(y=(education[education["country"]=="Canada"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[1,1].axhline(y=(education[education["country"]=="Canada"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
plt.suptitle("Origin Country Comparison on Education Attainment (Only immigrants)")
plt.tight_layout()
plt.show()

Korea and Canada has similar trends in education attainment. Immigrants from both countries has higher percentage of getting upper tertiary education than below tertiary education. This trend is same for their national average. Yet, immigrants have higher average of upper tertiary education than their national average. On the other hand, Mexican and Colombian immigrants have higher education attainment than their national resident in both below and upper tertiary education.

Employment Rate

Code

kor["employment"] = np.select([kor["ESR"].isin([1,2,4,5]), kor["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
mex["employment"] = np.select([mex["ESR"].isin([1,2,4,5]), mex["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
col["employment"] = np.select([col["ESR"].isin([1,2,4,5]), col["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
can["employment"] = np.select([can["ESR"].isin([1,2,4,5]), can["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=kor[kor["NATIVITY"]==2],ax=axs[0,0]).set(xlabel="Employment Status", title = "Korea")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=mex[mex["NATIVITY"]==2],ax=axs[1,0]).set(xlabel="Employment Status", title = "Mexico")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=col[col["NATIVITY"]==2],ax=axs[0,1]).set(xlabel="Employment Status", title = "Colombia")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=can[can["NATIVITY"]==2],ax=axs[1,1]).set(xlabel="Employment Status", title = "Canada")
axs[0,0].axhline(y=employment[employment["country"] == "Korea"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[1,0].axhline(y=employment[employment["country"] == "Mexico"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[0,1].axhline(y=employment[employment["country"] == "Colombia"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[1,1].axhline(y=employment[employment["country"] == "Canada"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
axs[0,0].set_ylim(0,100)
axs[1,0].set_ylim(0,100)
axs[0,1].set_ylim(0,100)
axs[1,1].set_ylim(0,100)
plt.suptitle("Origin Country Comparison on Employment Rate (Only immigrants)")
plt.tight_layout()
plt.show()

In terms of employment rate, immigration provides more employments to people from Korea and Colombia, especially all Korean immigrants are employed. Mexican immigrants have about same employment rate as Mexicana nationals. Then Canadian immigrants are less employed than Canadian nationals.

N-400 Word Cloud

Code

from wordcloud import WordCloud, STOPWORDS
from collections import Counter

words_to_remove = ["information","use","give","state","provide"]

# Update the stopwords with the additional words to be removed
stopwords = set(STOPWORDS)
stopwords.update(words_to_remove)

txt = open("data/cleaned_n_400.txt",'r').readlines()[0]
wordcloud = WordCloud(
    width = 3000,
    height = 2000, 
    random_state=1, 
    background_color='salmon', 
    colormap='Pastel1', 
    collocations=False,
    stopwords = stopwords).generate(txt)
plt.figure(figsize=(30, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

print("Top 20 most used words")

words = txt.split()
word_count = Counter(words)
for word in list(word_count):
    if word in words_to_remove:
        del word_count[word]
word_count.most_common(20)

Top 20 most used words

[('address', 57),
 ('spouse', 53),
 ('current', 46),
 ('foreign', 44),
 ('date', 43),
 ('country', 32),
 ('child', 30),
 ('person', 29),
 ('application', 29),
 ('applicant', 26),
 ('signature', 26),
 ('year', 25),
 ('citizen', 25),
 ('applicable', 25),
 ('complete', 24),
 ('item', 22),
 ('birth', 22),
 ('group', 22),
 ('prior', 19),
 ('family', 18)]

Form the world cloud, words like “address”, “spouse”, “child”, “citizen”, “birth” are common words that related to the indicators. This might indicates that marital status would be a good indicator deciding the success of immigrants.

MPI Word Cloud

Code

from wordcloud import WordCloud, STOPWORDS

mpi = pd.read_csv("./data/MPI_cleaned.csv")
txts = " ".join(list(mpi["text"]))

words_to_remove = ["immigrant","bear","percent","mexican","canadian","korean","canadians","united","states","approximately","figure","rate","colombian","mexicans"]

# Update the stopwords with the additional words to be removed
stopwords = set(STOPWORDS)
stopwords.update(words_to_remove)

wordcloud = WordCloud(
    width = 3000,
    height = 2000, 
    random_state=1, 
    background_color='salmon', 
    colormap='Pastel1', 
    collocations=False,
    stopwords = stopwords).generate(txts)
plt.figure(figsize=(30, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

print("Top 20 most used words")

words = txts.split()
word_count = Counter(words)
for word in list(word_count):
    if word in words_to_remove:
        del word_count[word]
word_count.most_common(20)

Top 20 most used words

[('population', 44),
 ('compare', 34),
 ('overall', 28),
 ('foreign', 26),
 ('high', 24),
 ('age', 23),
 ('likely', 19),
 ('native', 19),
 ('english', 17),
 ('year', 17),
 ('old', 16),
 ('unauthorized', 12),
 ('daca', 12),
 ('education', 11),
 ('student', 11),
 ('country', 11),
 ('adult', 10),
 ('international', 10),
 ('citizen', 10),
 ('total', 10)]

We can see “age”, “english”, and “education” are pretty common words in the reports.