Data Exploration

Exploratory Data Analysis (EDA) is an important step to see the general trends or to get an insight of the dataset. Visulization will provide an evidence of possible patterns or outlilers that need to be dealt with. US Census, World Bank, and OECE data will be used to see the differences between immigrants and native-borns. USCIS and MPI data will be visulized as a word cloud to see the frequency of words used.


Country Comparison

Since each immigrant comes from different place, the comparison with origin countries brings different perspectives of the success. In this section, we are looking at four different countries, Mexico, Colombia, Canada, and Korea, on their nationa average of wage, education attainment, and employment rate.

Wage

Code
acs["SCHL"] = acs["SCHL"].astype(int)
kor = acs[acs["POBP"]=="Korea"]
mex = acs[acs["POBP"]=="Mexico"]
col = acs[acs["POBP"]=="Colombia"]
can = acs[acs["POBP"]=="Canada"]
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=kor, ax = axs[0,0]).set(xlabel="Normalized Wage", title = "Korea")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=mex, ax = axs[1,0]).set(xlabel="Normalized Wage", title = "Mexico")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=col, ax = axs[0,1]).set(xlabel="Normalized Wage", title = "Colombia")
sns.boxplot(x="NORM_WAGP", y= "NATIVITY", hue='NATIVITY', data=can, ax = axs[1,1]).set(xlabel="Normalized Wage", title = "Canada")
axs[0,0].axvline(x=np.log(wage[wage.country == "Korea"].iloc[0,31]), color='red', linestyle='--',label = "National Average")
axs[1,0].axvline(x=np.log(wage[wage.country == "Mexico"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[0,1].axvline(x=np.log(wage[wage.country == "Colombia"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[1,1].axvline(x=np.log(wage[wage.country == "Canada"].iloc[0,31]), color='red', linestyle='--', label = "National Average")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
plt.suptitle("Origin Country Comparison on Normalized Wage")
plt.tight_layout()
plt.show()

Red lines represent national averages on each countries. Both native-borns and immigrants from all those four countries have less median wage than their origin residents. As addressed in the prior research in introduction, immigrants from Korea and Canada, relatively more developed countries than Mexico and Colombia, have less median wage than native-borns and even people from Colombia and Mexico.

Education Attainment

Code
kor["edu"] = np.select([kor["SCHL"] < 19, kor["SCHL"] >= 19], ["Below", "Upper"], default=0)
mex["edu"] = np.select([mex["SCHL"] < 19, mex["SCHL"] >= 19], ["Below", "Upper"], default=0)
col["edu"] = np.select([col["SCHL"] < 19, col["SCHL"] >= 19], ["Below", "Upper"], default=0)
can["edu"] = np.select([can["SCHL"] < 19, can["SCHL"] >= 19], ["Below", "Upper"], default=0)
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=kor[kor["NATIVITY"]==2],ax=axs[0,0]).set(xlabel="Education Attainment", title = "Korea")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=mex[mex["NATIVITY"]==2],ax=axs[1,0]).set(xlabel="Education Attainment", title = "Mexico")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=col[col["NATIVITY"]==2],ax=axs[0,1]).set(xlabel="Education Attainment", title = "Colombia")
sns.histplot(x="edu", stat='percent', hue="NATIVITY", data=can[can["NATIVITY"]==2],ax=axs[1,1]).set(xlabel="Education Attainment", title = "Canada")
axs[0,0].axhline(y=(education[education["country"]=="Korea"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[0,0].axhline(y=(education[education["country"]=="Korea"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[1,0].axhline(y=(education[education["country"]=="Mexico"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[1,0].axhline(y=(education[education["country"]=="Mexico"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[0,1].axhline(y=(education[education["country"]=="Colombia"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[0,1].axhline(y=(education[education["country"]=="Colombia"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[1,1].axhline(y=(education[education["country"]=="Canada"].iloc[1,32]), color='red', linestyle='--',label="Below tertiary")
axs[1,1].axhline(y=(education[education["country"]=="Canada"].iloc[2,32]), color='green', linestyle='--',label="Upper tertiary")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
plt.suptitle("Origin Country Comparison on Education Attainment (Only immigrants)")
plt.tight_layout()
plt.show()

Korea and Canada has similar trends in education attainment. Immigrants from both countries has higher percentage of getting upper tertiary education than below tertiary education. This trend is same for their national average. Yet, immigrants have higher average of upper tertiary education than their national average. On the other hand, Mexican and Colombian immigrants have higher education attainment than their national resident in both below and upper tertiary education.

Employment Rate

Code
kor["employment"] = np.select([kor["ESR"].isin([1,2,4,5]), kor["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
mex["employment"] = np.select([mex["ESR"].isin([1,2,4,5]), mex["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
col["employment"] = np.select([col["ESR"].isin([1,2,4,5]), col["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
can["employment"] = np.select([can["ESR"].isin([1,2,4,5]), can["ESR"].isin([0,3,6])], ["Employed", "Unemployed"], default=0)
fig, axs = plt.subplots(2,2, figsize=(16, 9))
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=kor[kor["NATIVITY"]==2],ax=axs[0,0]).set(xlabel="Employment Status", title = "Korea")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=mex[mex["NATIVITY"]==2],ax=axs[1,0]).set(xlabel="Employment Status", title = "Mexico")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=col[col["NATIVITY"]==2],ax=axs[0,1]).set(xlabel="Employment Status", title = "Colombia")
sns.histplot(x="employment", stat='percent', hue="NATIVITY", data=can[can["NATIVITY"]==2],ax=axs[1,1]).set(xlabel="Employment Status", title = "Canada")
axs[0,0].axhline(y=employment[employment["country"] == "Korea"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[1,0].axhline(y=employment[employment["country"] == "Mexico"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[0,1].axhline(y=employment[employment["country"] == "Colombia"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[1,1].axhline(y=employment[employment["country"] == "Canada"].iloc[0,31], color='red', linestyle='--',label = "National Average")
axs[0,0].legend()
axs[1,0].legend()
axs[0,1].legend()
axs[1,1].legend()
axs[0,0].set_ylim(0,100)
axs[1,0].set_ylim(0,100)
axs[0,1].set_ylim(0,100)
axs[1,1].set_ylim(0,100)
plt.suptitle("Origin Country Comparison on Employment Rate (Only immigrants)")
plt.tight_layout()
plt.show()

In terms of employment rate, immigration provides more employments to people from Korea and Colombia, especially all Korean immigrants are employed. Mexican immigrants have about same employment rate as Mexicana nationals. Then Canadian immigrants are less employed than Canadian nationals.

N-400 Word Cloud

Code
from wordcloud import WordCloud, STOPWORDS
from collections import Counter

words_to_remove = ["information","use","give","state","provide"]

# Update the stopwords with the additional words to be removed
stopwords = set(STOPWORDS)
stopwords.update(words_to_remove)

txt = open("data/cleaned_n_400.txt",'r').readlines()[0]
wordcloud = WordCloud(
    width = 3000,
    height = 2000, 
    random_state=1, 
    background_color='salmon', 
    colormap='Pastel1', 
    collocations=False,
    stopwords = stopwords).generate(txt)
plt.figure(figsize=(30, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

print("Top 20 most used words")

words = txt.split()
word_count = Counter(words)
for word in list(word_count):
    if word in words_to_remove:
        del word_count[word]
word_count.most_common(20)

Top 20 most used words
[('address', 57),
 ('spouse', 53),
 ('current', 46),
 ('foreign', 44),
 ('date', 43),
 ('country', 32),
 ('child', 30),
 ('person', 29),
 ('application', 29),
 ('applicant', 26),
 ('signature', 26),
 ('year', 25),
 ('citizen', 25),
 ('applicable', 25),
 ('complete', 24),
 ('item', 22),
 ('birth', 22),
 ('group', 22),
 ('prior', 19),
 ('family', 18)]

Form the world cloud, words like “address”, “spouse”, “child”, “citizen”, “birth” are common words that related to the indicators. This might indicates that marital status would be a good indicator deciding the success of immigrants.

MPI Word Cloud

Code
from wordcloud import WordCloud, STOPWORDS

mpi = pd.read_csv("./data/MPI_cleaned.csv")
txts = " ".join(list(mpi["text"]))

words_to_remove = ["immigrant","bear","percent","mexican","canadian","korean","canadians","united","states","approximately","figure","rate","colombian","mexicans"]

# Update the stopwords with the additional words to be removed
stopwords = set(STOPWORDS)
stopwords.update(words_to_remove)

wordcloud = WordCloud(
    width = 3000,
    height = 2000, 
    random_state=1, 
    background_color='salmon', 
    colormap='Pastel1', 
    collocations=False,
    stopwords = stopwords).generate(txts)
plt.figure(figsize=(30, 20))
plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

print("Top 20 most used words")

words = txts.split()
word_count = Counter(words)
for word in list(word_count):
    if word in words_to_remove:
        del word_count[word]
word_count.most_common(20)

Top 20 most used words
[('population', 44),
 ('compare', 34),
 ('overall', 28),
 ('foreign', 26),
 ('high', 24),
 ('age', 23),
 ('likely', 19),
 ('native', 19),
 ('english', 17),
 ('year', 17),
 ('old', 16),
 ('unauthorized', 12),
 ('daca', 12),
 ('education', 11),
 ('student', 11),
 ('country', 11),
 ('adult', 10),
 ('international', 10),
 ('citizen', 10),
 ('total', 10)]

We can see “age”, “english”, and “education” are pretty common words in the reports.

Back to top