Data Cleaning

Data cleaning is essential step of any data science project. This section will show the process of cleaning and the reasoning in terms of integration. Because the data are from various sources, the integration is necessary and this step will make further data analysis smoother by improving the quality of data.


US Census Bureau

In raw Census data, POBP column is recoded as numbers, not actual places as strings. Those numbers are changed to actual countries names using meta data of that variable for later analysis. For the data integration, “United Kingdom, Not Specified”, “Scotland”, and “England” are all merged into “United Kingdom” and all U.S. territories are labeled as “United States”. Lastly, Success column is created based on their WAGP and SCHL value. If they are making more money than 100000 or attained more than Bachelor’s degree, they are considered “successful”. Finally, whoever is not a citizen is removed from the dataset because they are not part of the project.

Code
import requests
import json
import pandas as pd
import numpy as np

meta = requests.get("https://api.census.gov/data/2021/acs/acs1/pums/variables/POBP.json")
meta = meta.json()
# Get recoded values for POBP column
codes = meta["values"]["item"]
codes = {int(key): value for key, value in codes.items()}
acs_raw = pd.read_csv("./data/acs_raw.csv")

# Change country codes to names
for i in range(len(acs_raw)):
    # Combine "United Kingdom, Not Specified", "Scotland", "England" into one value
    if acs_raw.loc[i,"POBP"] in [138,139,140]: 
        acs_raw.loc[i,"POBP"] = "United Kingdom"
    # Change all the U.S. territories to "United States"
    elif acs_raw.loc[i,"POBP"] < 100:
        acs_raw.loc[i,"POBP"] = "United States"       
    else:
        acs_raw.loc[i,"POBP"] = codes[acs_raw.loc[i,"POBP"]]

# Create success variable
condition1 = acs_raw["WAGP"] > 100000
condition2 = acs_raw["SCHL"] >= 21
acs_raw["SUCCESS"] = np.select([condition1, condition2], [1,1], default=0)

# Exclude non-citizens
acs_raw = acs_raw[acs_raw["CIT"] != 5]
acs_raw = acs_raw.drop(["CIT"],axis=1)
acs_raw.head()

# acs_raw.to_csv("./data/acs_cleaned.csv", index = False)
NATIVITY POBP DECADE ENG MAR RAC1P SEX ESR WAGP SCHL AGEP SUCCESS
0 1 United States 0 0 1 1 1 6 0 11 36 0
1 1 United States 0 0 5 1 1 6 0 22 57 1
2 1 United States 0 0 5 5 1 6 0 14 29 0
3 1 United States 0 0 5 1 1 6 0 1 26 0
4 1 United States 0 0 2 1 2 6 0 21 80 1

World Bank and OECD

Because average wage, employment rate, and education attainment rate dataset are from different sources, it is necessary to keep them integrated in terms of countries. Since wage dataset features the lowest number of countries, other datasets are subsetted to have only those countries using conutry codes, which leads down to 38 countries total. Then these country codes are changed to country names. Especially “Korea, Rep”, “Turkiye”, “Czechia”, are “Slovak Republic” are changed to “Korea”, “Turkey”, “Czech Republic”, and “Slovakia” so that they can be aligned with ACS data.

Code
employment_raw = read.csv("./data/employment_raw.csv")
wage_raw = read.csv("./data/wage_raw.csv")
education_raw = read.csv("./data/education_raw.csv")

# Subsetting dataset on common countries
education_raw = education_raw[education_raw$iso3c %in% wage_raw$iso3c,]
employment_raw = employment_raw[employment_raw$iso3c %in% wage_raw$iso3c,]

# Putting country names by country codes
country_code = employment_raw[,c(1,2)]
colnames(country_code) = c("iso3c","country")
education_raw = merge(education_raw, country_code, by = "iso3c", all = TRUE)
wage_raw = merge(wage_raw, country_code, by = "iso3c", all = TRUE)

# Change name of certain countries so that it can be same in acs dataset
for(i in 1:length(employment_raw$country)){
    if(employment_raw$country[i]=="Korea, Rep."){
        employment_raw$country[i] = "Korea"
    }
    if(employment_raw$country[i]=="Turkiye"){
        employment_raw$country[i] = "Turkey"
    }
    if(employment_raw$country[i]== "Czechia"){
        employment_raw$country[i] = "Czech Republic"
    }
    if(employment_raw$country[i]== "Slovak Republic"){
        employment_raw$country[i] = "Slovakia"
    }
}

for(i in 1:length(wage_raw$country)){
    if(wage_raw$country[i]=="Korea, Rep."){
        wage_raw$country[i] = "Korea"
    }
    if(wage_raw$country[i]=="Turkiye"){
        wage_raw$country[i] = "Turkey"
    }
    if(wage_raw$country[i]== "Czechia"){
        wage_raw$country[i] = "Czech Republic"
    }
    if(wage_raw$country[i]== "Slovak Republic"){
        wage_raw$country[i] = "Slovakia"
    }
}

for (i in 1:length(education_raw$country)) {
    if (education_raw$country[i] == "Korea, Rep.") {
        education_raw$country[i] = "Korea"
    }
    if (education_raw$country[i] == "Turkiye") {
        education_raw$country[i] = "Turkey"
    }
    if (education_raw$country[i] == "Czechia") {
        education_raw$country[i] = "Czech Republic"
    }
    if (education_raw$country[i] == "Slovak Republic") {
        education_raw$country[i] = "Slovakia"
    }
}

# Delete country code column
employment_raw = employment_raw[,-1]
wage_raw = wage_raw[,-1]
education_raw = education_raw[,-1]

# Rearrange the column order
employment_raw = employment_raw[,c(tail(sort(colnames(employment_raw)), 1), head(sort(colnames(employment_raw)), -1))]
wage_raw = wage_raw[,c(tail(sort(colnames(wage_raw)), 1), head(sort(colnames(wage_raw)), -1))]
education_raw = education_raw[,c(tail(sort(colnames(education_raw)), 1), head(sort(colnames(education_raw)), -1))]

# write.csv(wage_raw, file = "./data/wage_cleaned.csv", row.names = FALSE)
# write.csv(employment_raw, file = "./data/employment_cleaned.csv", row.names = FALSE)
# write.csv(education_raw, file = "./data/education_cleaned.csv", row.names = FALSE)

Employment rate

A data.frame: 6 × 32
country X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 ... X2012 X2013 X2014 X2015 X2016 X2017 X2018 X2019 X2020 X2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
11 Australia 57.091 56.115 55.662 56.813 58.158 58.114 57.843 58.162 58.448 ... 61.749 61.264 60.766 61.068 61.166 61.527 62.152 62.547 60.829 62.478
12 Austria 53.930 54.558 54.281 56.969 56.841 55.714 55.402 55.597 56.198 ... 57.801 57.667 57.279 57.247 57.531 57.843 58.394 58.605 57.513 57.380
19 Belgium 45.831 46.305 45.712 45.473 45.810 45.631 45.998 46.111 47.420 ... 49.235 49.045 48.951 48.789 48.951 50.023 50.966 51.475 50.833 51.079
36 Canada 59.709 58.342 57.904 58.383 58.669 58.449 58.967 59.737 60.563 ... 61.652 61.761 61.430 61.290 61.110 61.591 61.597 61.991 57.964 60.216
41 Chile 50.410 51.885 52.344 51.957 53.392 52.434 52.577 52.602 50.328 ... 55.783 56.074 56.019 56.012 55.645 55.774 55.563 55.286 49.211 51.788
43 Colombia 59.965 60.616 62.045 62.113 62.111 60.033 59.962 57.909 53.953 ... 62.649 62.783 63.224 63.713 63.068 62.779 62.160 60.759 53.270 55.388

Annual average wage

A data.frame: 6 × 32
country X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 ... X2012 X2013 X2014 X2015 X2016 X2017 X2018 X2019 X2020 X2021
<chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> ... <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 Australia 42309.38 43173.77 43578.24 43873.86 43715 44976.28 46355.56 46999.64 48064.87 ... 57752 57579 58102 57744 57885 57619 57843 58620 60377 60681.50
2 Austria 52697.40 53759.06 54197.37 54743.42 55184 54819.62 54346.95 56101.14 57354.90 ... 62515 62568 62801 63231 63860 63856 64101 64623 64648 65402.32
3 Belgium 53018.64 54718.73 55868.83 56952.78 56759 57470.56 58203.38 58238.28 61484.41 ... 64461 65099 65461 65017 65157 64700 65083 65700 63677 65520.82
4 Canada 42426.30 43045.90 42952.72 42464.18 42369 42745.09 44053.77 44793.21 45157.23 ... 53717 54286 54995 55400 54350 55122 56083 56370 59160 59568.78
5 Switzerland 56464.63 56948.80 57572.46 58283.21 58316 57610.92 58645.79 58814.06 59892.08 ... 69383 70280 70412 70797 70468 70087 69892 71189 69728 72358.44
6 Chile NA NA NA NA 17383 17383.36 18600.43 19515.21 20533.36 ... 29661 30530 30657 30615 31834 30847 32348 33190 31369 33042.33

Adult education attainment rate

A data.frame: 6 × 33
country SUBJECT X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 ... X2012 X2013 X2014 X2015 X2016 X2017 X2018 X2019 X2020 X2021
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia BUPPSRY 44.12706 NA 47.15905 49.80203 44.93585 NA 46.69527 43.95596 ... 23.56339 24.28443 22.89584 20.98090 20.06735 19.01266 18.10864 17.13266 16.24498 15.46888
2 Australia TRY 31.15881 NA 22.46919 23.07237 24.31545 NA 24.30127 25.42071 ... 41.28236 39.53993 41.90185 42.88876 43.74390 45.35567 45.72748 47.12998 49.33745 49.76787
3 Australia UPPSRY 24.71413 NA 30.37176 27.12561 30.74870 NA 29.00345 30.62333 ... 35.15425 36.17564 35.20231 36.13035 36.18875 35.63167 36.16388 35.73737 34.41757 34.76325
4 Austria BUPPSRY NA NA NA NA NA NA NA NA ... 17.08353 17.02800 16.14400 15.35130 15.47159 15.03758 14.70253 14.43586 14.34133 14.06036
5 Austria TRY NA NA NA NA NA NA NA NA ... 28.73994 29.73884 29.90490 30.55073 31.38396 32.39439 32.71143 33.77378 34.20589 34.60450
6 Austria UPPSRY NA NA NA NA NA NA NA NA ... 54.17653 53.23315 53.95110 54.09797 53.14444 52.56804 52.58605 51.79036 51.45277 51.33514

USCIS

N-400 form is provided in pdf format. It is read through each page and converted into a whole string. These strings are tokenized by words and those words are lemmentized. Among those words unnecessary words such as “uscis”, “answer”, or “page” are removed. Then cleaned words are merged again into a whole new string.

Code
import fitz
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")

# Make a function that cleans a pdf file
def clean_pdf(file):
    pdf_document = fitz.open(file)
    pdf_txt = ""
    # Loop through each pages in the document and compile all text into one string
    for page_number in range(pdf_document.page_count):
        page_txt = pdf_document[page_number].get_text()
        pdf_txt = pdf_txt+page_txt
    cleaned_txt = ""
    doc = nlp(pdf_txt)
    # Loop through each words in pdf string
    for token in doc:
        if token.is_alpha == True: # Check whether token is alphabet
            if len(token) > 2: # Remove tokens that has less then 3 characters
                if token.is_stop != True: # Remove stop words
                    token = token.lemma_ # Lemmentize the words
                    token = token.lower() # Change to lower case
                    if token not in ["yes","uscis","answer","number","yyyy","form","edition","page","united","states","code"]: # Remove some unnecessary words
                        cleaned_txt = cleaned_txt+token+" "
    return cleaned_txt

n_400 = clean_pdf("./data/n-400.pdf")
n_400

# with open("./data/cleaned_n_400.txt", "w") as file:
#     file.write(n_400)
'use explain apply basis qualify military service lawful permanent resident year addition marry live citizen spouse year spouse citizen year time file lawful permanent resident year year age information eligibility select box delay enter digit date stamp remarks current legal provide nickname family give middle applicable information person apply naturalization family give exactly appear permanent resident card applicable lawful permanent resident spouse citizen citizen spouse regularly engage specify employment abroad immigration nationality act ina section residential address outside file section select field office list like naturalization interview middle applicable start type print black ink type print item applicable indicate failure question delay citizenship immigration services process note complete part action block receipt application naturalization department homeland security citizenship immigration services omb expire biological legal adoptive mother father citizen birth naturalize reach birthday citizen consider file application visit website information topic review instruction application certificate citizenship application citizenship issuance certificate section note parent citizen complete information parent application skip biographic information information person apply naturalization continue middle applicable give family type print new like use space provide like legally change read instructions decide like legally change change optional date lawful permanent resident social security applicable country citizenship nationality country birth date birth online account gender male female physical developmental disability mental impairment prevent demonstrate knowledge understanding english language civic requirement naturalization submit complete medical certification disability exceptions file exemption english language test year age old live lawful permanent resident period total year time file year age old live lawful permanent resident period total year time file year age old live lawful permanent resident period total year time file meet requirement give simplified version civic test accommodation individual disabilities impairments deaf hard hear request following accommodation request sign language interpreter indicate language example american sign language select applicable box request accommodation disability impairment blind low vision request following accommodation note read information instructions complete middle applicable give family names birth include nickname alias maiden applicable information contact information residence province region foreign address city town street zip state county flr ste apt country foreign address live year provide recent residence list location live year need extra space use additional sheet paper current physical address postal foreign address daytime telephone work telephone evening telephone mobile telephone email address city town state zip street care flr ste apt postal foreign address province region foreign address country foreign address current mailing address different address date residence county type disability impairment example use wheelchair describe nature disability impairment accommodation request accommodation individual disabilities impairment continued usps zip lookup present information residence continued street flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address street flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address street date residence flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address information parent mother citizen complete follow information item information mother parent marry birthday date residence date residence parent citizen skip mother country birth mother date birth date mother citizen know mother information parents continued middle applicable give family father country birth father date birth complete information date father citizen know father information father father citizen current legal citizen father american indian alaska native note require complete category conduct background check instructions information biographic information height feet inches race select applicable box native hawaiian pacific islander black african american asian white hispanic latino hispanic latino ethnicity select box weight pounds brown blue green hazel gray black pink maroon eye color select box bald hair sandy red white gray blond brown black hair color select box current legal citizen mother middle applicable give family list work attend school time time year provide information complete time period include military police intelligence service begin provide information recent current employment study unemployment applicable provide location date work self employ unemployed study year work type print self employ unemployed type print unemployed need extra space use additional sheet paper city town state zip flr apt ste street employer school occupation date date employer school city town state zip street flr ste apt postal foreign address province region foreign address postal foreign address province region foreign address information employment schools attend country foreign address occupation date date country foreign address employer school street flr ste apt zip state city town occupation date date postal foreign address province region foreign address country foreign address list trip hour long take outside year start recent trip work backwards need extra space use additional sheet paper date leave date return trip month countries travel total day outside trip hour long take outside year trip total day hour long spend outside year day time outside current marital status time marry include annulled marriage marriage people marriage person marry provide follow information current spouse current spouse legal divorced single marry widow marry marriage annul middle applicable give family separate marry spouse current member armed force current spouse previous legal middle applicable give family names current spouse include nickname alias maiden applicable family give middle applicable information marital history current spouse date birth date enter marriage current spouse single marry information marital history continued street province region foreign address postal foreign address zip state county city town current spouse present home address current spouse current employer company flr ste apt current spouse citizen item item country foreign address date current spouse citizen current spouse citizen complete follow information current spouse country citizenship nationality current spouse current spouse immigration status explain lawful permanent resident current spouse citizen current spouse citizen complete follow information birth complete following information item middle applicable give family time current spouse marry include annulled marriage marriage people marriage person current spouse marry provide follow information current spouse prior spouse legal current spouse prior spouse immigration status current spouse prior spouse know citizen lawful permanent resident explain date birth current spouse prior spouse country birth current spouse prior spouse country citizenship nationality current spouse prior spouse current spouse previous marriage provide information additional sheet paper information marital history continued current spouse date marriage prior spouse date current spouse marriage end prior spouse current spouse marriage end prior spouse annul divorced spouse deceased explain marry provide follow information prior spouse previous marriage provide information additional sheet paper explain spouse deceased divorced annul prior spouse immigration status marriage end know citizen lawful permanent resident explain middle applicable give family prior spouse date birth date marriage prior spouse date marriage end prior spouse marriage end prior spouse prior spouse country birth prior spouse country citizenship nationality prior spouse legal indicate total child indicate child include child alive miss deceased child bear country child year age old child currently married unmarried child live current stepchild legally adopt child child bear marry information child provide follow information child son daughter list item regardless age list additional child use additional sheet paper middle applicable give family child current legal date birth country birth information child continued child middle applicable give family date birth current address street province region foreign address postal foreign address zip state county city town child relationship example biological child stepchild legally adopt child flr ste apt current address street province region foreign address postal foreign address zip state county city town flr ste apt child relationship example biological child stepchild legally adopt child country foreign address country foreign address current legal current legal middle applicable give family country birth child country birth date birth information child continued current address street province region foreign address postal foreign address zip state county city town flr ste apt child relationship example biological child stepchild legally adopt child current legal middle applicable give family country birth current address street province region foreign address postal foreign address zip state county city town flr ste apt date birth child relationship example biological child stepchild legally adopt child country foreign address country foreign address child additional information person apply naturalization register vote federal state local election claim citizen writing way item numbers question include type print explanation additional sheet paper vote federal state local election hereditary title order nobility foreign country willing inherit title order nobility foreign country naturalization ceremony declare legally incompetent confine mental institution additional information person apply naturalization continue owe overdue federal state local taxis file federal state local tax return lawful permanent resident consider non resident call non resident federal state local tax return lawful permanent resident member involve way associate organization association fund foundation party club society similar group location world provide information need extra space attach name group additional sheet paper provide evidence support group purpose group date membership member way associate directly indirectly communist party totalitarian party advocate directly indirectly overthrow government force violence march work associate way directly indirectly nazi government germany government area occupy ally establish help nazi government germany persecute directly indirectly person race religion national origin membership particular social group political opinion terrorist organization german nazi military unit paramilitary unit self defense unit vigilante unit citizen unit police unit government agency office extermination camp concentration camp prisoner war camp prison labor camp transit camp additional information person apply naturalization continue involve way following genocide torture killing try kill badly hurt try hurt person purpose forcing try force kind sexual contact relation let practice religion member serve help participate follow group military unit paramilitary unit group people act like military group official military police unit self defense unit vigilante unit group people act like police official police rebel group guerrilla group group people use weapon physically attack military police government people militia army people official military insurgent organization group use weapon fight government worker volunteer soldier serve following labor camp place people force work detention facility place people force stay prison camp prison jail place people force stay group help group use weapon person group help group unit organization weapon person threaten group help group tell person use weapon person sell provide weapon person help person sell provide weapon person know person go use weapon person know person go sell weapon go use person convict crime offense charge commit attempt commit assist commit crime offense arrest cite detain law enforcement officer include immigration official official armed force reason commit assist commit attempt commit crime offense arrest item numbers apply record seal expunged clear disclose information include judge law enforcement officer attorney tell long constitute record tell disclose information additional information person apply naturalization continue complete probation parole receive suspend sentence place probation parole jail prison long jail prison year month day question item numbers complete table need extra space use additional sheet paper provide evidence support arrest cite detain charge date arrest cite detain charge arrest cite detain charge city town state country outcome disposition arrest citation detention charge charge file charge dismiss jail probation etc place alternative sentencing rehabilitative program example diversion deferred prosecution withhold adjudication deferred adjudication question item numbers skip item item receive type military paramilitary group people act like military group official military weapon training recruit ask enlist sign conscript require use person year age serve help armed force group use person year age help support people combat item numbers question item numbers include type print explanation additional sheet paper provide evidence support habitual drunkard prostitute procure prostitution sell smuggle control substance illegal drug narcotic married person time helped enter try enter illegally gamble illegally receive income illegal gambling marry order obtain immigration benefit additional information person apply naturalization continue give government official information documentation false fraudulent misleading lie government official gain entry admission gain immigration benefit remove exclude deport order remove exclude deport removal exclusion rescission deportation proceeding include administratively close proceeding currently pende place removal exclusion rescission deportation proceeding serve armed force currently member armed force fail support dependent pay alimony misrepresentation obtain public benefit schedule deploy overseas include vessel month refer address change section instruction notify learn deployment plan file currently station overseas court martiale administratively separate discipline receive honorable discharge armed force discharge training service armed force alien leave avoid draft armed force apply kind exemption military service armed force desert armed force male live time birthday include live lawful nonimmigrant selective service date register register selective service provide information additional information person apply naturalization continue item numbers question include type print explanation additional sheet paper provide evidence support support constitution government understand oath allegiance law require willing bear arm behalf law require willing perform noncombatant service armed force willing oath allegiance law require willing perform work national importance civilian direction register selective service system year age register apply naturalization complete selective service information year age year age file ina section register selective service attach statement explain register provide status information letter selective service applicant statement certification signature note read penalties section instructions complete read understand english read understand question instruction application question applicant statement interpreter question interpreter name read question instruction application language fluent understand note select box item item applicable select box item applicant statement applicant statement preparer request preparer name prepare application base information provide authorize applicant statement certification signature continued applicant certification understand require appear appointment biometric fingerprint photograph signature time require sign oath reaffirming applicant signature applicant signature date signature note applicant completely fill application fail submit require document list instruction deny application interpreter business organization city town state postal province street apt flr ste country interpreter mailing address zip interpreter give interpreter family provide follow information interpreter interpreter interpreter contact information certification signature copies document submit exact photocopy unaltered original document understand require submit original document later date furthermore authorize release information record need determine eligibility immigration benefit seek authorize release information contain application support document record entity person necessary administration enforcement immigration law certify penalty perjury provide authorize information application understand information contain submit application information complete true correct review provide authorize information application understand information contain submit application information complete true correct time filing interpreter contact information certification signature continue interpreter certification certify penalty perjury fluent english language specify item item read applicant identify language question instruction application question applicant inform understand instruction question application include applicant certification verify accuracy date signature interpreter signature interpreter signature contact information declaration signature person prepare application applicant provide follow information preparer preparer preparer give preparer family preparer business organization street apt flr ste city town state postal province country preparer mailing address zip interpreter contact information interpreter daytime telephone interpreter email address interpreter mobile telephone contact information declaration signature person prepare application applicant continued preparer contact information preparer daytime telephone preparer mobile telephone preparer email address attorney accredit representative prepare application behalf applicant applicant consent attorney accredit representative representation applicant case extend preparation application preparer statement extend note attorney accredit representative representation extend preparation application oblige submit complete notice entry appearance attorney accredited representative application preparer certification signature certify penalty perjury prepare application request applicant applicant review complete application inform understand information contain submit application include applicant certification information complete true correct complete application base information applicant provide authorize obtain use preparer signature date signature preparer signature note complete parts officer instruct interview swear affirm certify penalty perjury law america know content application naturalization subscribe include correction complete true correct evidence submit complete true correct signature interview subscribe swear affirm officer printed stamp date signature applicant signature officer signature item item affirm following officer renounce title heretofore hold renounce order nobility heretofore belong applicant signature applicant printed officer signature officer printed list title list order nobility renunciation foreign titles date signature obligation freely mental reservation purpose evasion help god perform work national importance civilian direction require law perform noncombatant service armed force require law applicant signature applicant printed middle applicable give family date signature bear arm behalf require law bear true faith allegiance support defend constitution law america enemy foreign domestic declare oath absolutely entirely renounce abjure allegiance fidelity foreign prince potentate state sovereignty heretofore subject citizen application approve schedule public oath ceremony time require follow oath allegiance immediately prior naturalized citizen sign acknowledge willingness ability oath oath allegiance '

Migration Policy Institute

Similar to the above, the text data from MPI are cleaned by lemmentizing and removing short words.

Code
mpi_raw = pd.read_csv("./data/MPI_raw.csv")
cleaned_txt = []
for sentence in mpi_raw["text"]:
    doc = nlp(sentence)
    cleaned_sentence = ""
    for token in doc:
        if token.is_alpha == True: # Check whether token is alphabet
            if len(token) > 2: # Remove tokens that has less then 3 characters
                if token.is_stop != True: # Remove stop words
                    token = token.lemma_ # Lemmentize the words
                    token = token.lower() # Change to lower case
                    cleaned_sentence = cleaned_sentence+token+" "
    cleaned_txt.append(cleaned_sentence)

mpi_raw["text"] = np.array(cleaned_txt)
mpi_raw.head()

# mpi_raw.to_csv("./data/mpi_cleaned.csv", index = False)
text label
0 mexican immigrant likely proficient english ov... Mexico
1 percent mexicans age report limited english pr... Mexico
2 approximately percent mexican immigrant speak ... Mexico
3 note limited english proficient lep status ref... Mexico
4 median age year old compare immigrant native b... Mexico
Back to top