Data cleaning is essential step of any data science project. This section will show the process of cleaning and the reasoning in terms of integration. Because the data are from various sources, the integration is necessary and this step will make further data analysis smoother by improving the quality of data.
US Census Bureau
In raw Census data, POBP column is recoded as numbers, not actual places as strings. Those numbers are changed to actual countries names using meta data of that variable for later analysis. For the data integration, “United Kingdom, Not Specified”, “Scotland”, and “England” are all merged into “United Kingdom” and all U.S. territories are labeled as “United States”. Lastly, Success column is created based on their WAGP and SCHL value. If they are making more money than 100000 or attained more than Bachelor’s degree, they are considered “successful”. Finally, whoever is not a citizen is removed from the dataset because they are not part of the project.
Code
import requestsimport jsonimport pandas as pdimport numpy as npmeta = requests.get("https://api.census.gov/data/2021/acs/acs1/pums/variables/POBP.json")meta = meta.json()# Get recoded values for POBP columncodes = meta["values"]["item"]codes = {int(key): value for key, value in codes.items()}acs_raw = pd.read_csv("./data/acs_raw.csv")# Change country codes to namesfor i inrange(len(acs_raw)):# Combine "United Kingdom, Not Specified", "Scotland", "England" into one valueif acs_raw.loc[i,"POBP"] in [138,139,140]: acs_raw.loc[i,"POBP"] ="United Kingdom"# Change all the U.S. territories to "United States"elif acs_raw.loc[i,"POBP"] <100: acs_raw.loc[i,"POBP"] ="United States"else: acs_raw.loc[i,"POBP"] = codes[acs_raw.loc[i,"POBP"]]# Create success variablecondition1 = acs_raw["WAGP"] >100000condition2 = acs_raw["SCHL"] >=21acs_raw["SUCCESS"] = np.select([condition1, condition2], [1,1], default=0)# Exclude non-citizensacs_raw = acs_raw[acs_raw["CIT"] !=5]acs_raw = acs_raw.drop(["CIT"],axis=1)acs_raw.head()# acs_raw.to_csv("./data/acs_cleaned.csv", index = False)
NATIVITY
POBP
DECADE
ENG
MAR
RAC1P
SEX
ESR
WAGP
SCHL
AGEP
SUCCESS
0
1
United States
0
0
1
1
1
6
0
11
36
0
1
1
United States
0
0
5
1
1
6
0
22
57
1
2
1
United States
0
0
5
5
1
6
0
14
29
0
3
1
United States
0
0
5
1
1
6
0
1
26
0
4
1
United States
0
0
2
1
2
6
0
21
80
1
World Bank and OECD
Because average wage, employment rate, and education attainment rate dataset are from different sources, it is necessary to keep them integrated in terms of countries. Since wage dataset features the lowest number of countries, other datasets are subsetted to have only those countries using conutry codes, which leads down to 38 countries total. Then these country codes are changed to country names. Especially “Korea, Rep”, “Turkiye”, “Czechia”, are “Slovak Republic” are changed to “Korea”, “Turkey”, “Czech Republic”, and “Slovakia” so that they can be aligned with ACS data.
Code
employment_raw = read.csv("./data/employment_raw.csv")wage_raw = read.csv("./data/wage_raw.csv")education_raw = read.csv("./data/education_raw.csv")# Subsetting dataset on common countrieseducation_raw = education_raw[education_raw$iso3c %in% wage_raw$iso3c,]employment_raw = employment_raw[employment_raw$iso3c %in% wage_raw$iso3c,]# Putting country names by country codescountry_code = employment_raw[,c(1,2)]colnames(country_code) = c("iso3c","country")education_raw = merge(education_raw, country_code, by ="iso3c", all= TRUE)wage_raw = merge(wage_raw, country_code, by ="iso3c", all= TRUE)# Change name of certain countries so that it can be same in acs datasetfor(i in1:length(employment_raw$country)){if(employment_raw$country[i]=="Korea, Rep."){ employment_raw$country[i] ="Korea" }if(employment_raw$country[i]=="Turkiye"){ employment_raw$country[i] ="Turkey" }if(employment_raw$country[i]=="Czechia"){ employment_raw$country[i] ="Czech Republic" }if(employment_raw$country[i]=="Slovak Republic"){ employment_raw$country[i] ="Slovakia" }}for(i in1:length(wage_raw$country)){if(wage_raw$country[i]=="Korea, Rep."){ wage_raw$country[i] ="Korea" }if(wage_raw$country[i]=="Turkiye"){ wage_raw$country[i] ="Turkey" }if(wage_raw$country[i]=="Czechia"){ wage_raw$country[i] ="Czech Republic" }if(wage_raw$country[i]=="Slovak Republic"){ wage_raw$country[i] ="Slovakia" }}for (i in1:length(education_raw$country)) {if (education_raw$country[i] =="Korea, Rep.") { education_raw$country[i] ="Korea" }if (education_raw$country[i] =="Turkiye") { education_raw$country[i] ="Turkey" }if (education_raw$country[i] =="Czechia") { education_raw$country[i] ="Czech Republic" }if (education_raw$country[i] =="Slovak Republic") { education_raw$country[i] ="Slovakia" }}# Delete country code columnemployment_raw = employment_raw[,-1]wage_raw = wage_raw[,-1]education_raw = education_raw[,-1]# Rearrange the column orderemployment_raw = employment_raw[,c(tail(sort(colnames(employment_raw)), 1), head(sort(colnames(employment_raw)), -1))]wage_raw = wage_raw[,c(tail(sort(colnames(wage_raw)), 1), head(sort(colnames(wage_raw)), -1))]education_raw = education_raw[,c(tail(sort(colnames(education_raw)), 1), head(sort(colnames(education_raw)), -1))]# write.csv(wage_raw, file = "./data/wage_cleaned.csv", row.names = FALSE)# write.csv(employment_raw, file = "./data/employment_cleaned.csv", row.names = FALSE)# write.csv(education_raw, file = "./data/education_cleaned.csv", row.names = FALSE)
Employment rate
A data.frame: 6 × 32
country
X1991
X1992
X1993
X1994
X1995
X1996
X1997
X1998
X1999
...
X2012
X2013
X2014
X2015
X2016
X2017
X2018
X2019
X2020
X2021
<chr>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
...
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
11
Australia
57.091
56.115
55.662
56.813
58.158
58.114
57.843
58.162
58.448
...
61.749
61.264
60.766
61.068
61.166
61.527
62.152
62.547
60.829
62.478
12
Austria
53.930
54.558
54.281
56.969
56.841
55.714
55.402
55.597
56.198
...
57.801
57.667
57.279
57.247
57.531
57.843
58.394
58.605
57.513
57.380
19
Belgium
45.831
46.305
45.712
45.473
45.810
45.631
45.998
46.111
47.420
...
49.235
49.045
48.951
48.789
48.951
50.023
50.966
51.475
50.833
51.079
36
Canada
59.709
58.342
57.904
58.383
58.669
58.449
58.967
59.737
60.563
...
61.652
61.761
61.430
61.290
61.110
61.591
61.597
61.991
57.964
60.216
41
Chile
50.410
51.885
52.344
51.957
53.392
52.434
52.577
52.602
50.328
...
55.783
56.074
56.019
56.012
55.645
55.774
55.563
55.286
49.211
51.788
43
Colombia
59.965
60.616
62.045
62.113
62.111
60.033
59.962
57.909
53.953
...
62.649
62.783
63.224
63.713
63.068
62.779
62.160
60.759
53.270
55.388
Annual average wage
A data.frame: 6 × 32
country
X1991
X1992
X1993
X1994
X1995
X1996
X1997
X1998
X1999
...
X2012
X2013
X2014
X2015
X2016
X2017
X2018
X2019
X2020
X2021
<chr>
<dbl>
<dbl>
<dbl>
<dbl>
<int>
<dbl>
<dbl>
<dbl>
<dbl>
...
<int>
<int>
<int>
<int>
<int>
<int>
<int>
<int>
<int>
<dbl>
1
Australia
42309.38
43173.77
43578.24
43873.86
43715
44976.28
46355.56
46999.64
48064.87
...
57752
57579
58102
57744
57885
57619
57843
58620
60377
60681.50
2
Austria
52697.40
53759.06
54197.37
54743.42
55184
54819.62
54346.95
56101.14
57354.90
...
62515
62568
62801
63231
63860
63856
64101
64623
64648
65402.32
3
Belgium
53018.64
54718.73
55868.83
56952.78
56759
57470.56
58203.38
58238.28
61484.41
...
64461
65099
65461
65017
65157
64700
65083
65700
63677
65520.82
4
Canada
42426.30
43045.90
42952.72
42464.18
42369
42745.09
44053.77
44793.21
45157.23
...
53717
54286
54995
55400
54350
55122
56083
56370
59160
59568.78
5
Switzerland
56464.63
56948.80
57572.46
58283.21
58316
57610.92
58645.79
58814.06
59892.08
...
69383
70280
70412
70797
70468
70087
69892
71189
69728
72358.44
6
Chile
NA
NA
NA
NA
17383
17383.36
18600.43
19515.21
20533.36
...
29661
30530
30657
30615
31834
30847
32348
33190
31369
33042.33
Adult education attainment rate
A data.frame: 6 × 33
country
SUBJECT
X1991
X1992
X1993
X1994
X1995
X1996
X1997
X1998
...
X2012
X2013
X2014
X2015
X2016
X2017
X2018
X2019
X2020
X2021
<chr>
<chr>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
...
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
1
Australia
BUPPSRY
44.12706
NA
47.15905
49.80203
44.93585
NA
46.69527
43.95596
...
23.56339
24.28443
22.89584
20.98090
20.06735
19.01266
18.10864
17.13266
16.24498
15.46888
2
Australia
TRY
31.15881
NA
22.46919
23.07237
24.31545
NA
24.30127
25.42071
...
41.28236
39.53993
41.90185
42.88876
43.74390
45.35567
45.72748
47.12998
49.33745
49.76787
3
Australia
UPPSRY
24.71413
NA
30.37176
27.12561
30.74870
NA
29.00345
30.62333
...
35.15425
36.17564
35.20231
36.13035
36.18875
35.63167
36.16388
35.73737
34.41757
34.76325
4
Austria
BUPPSRY
NA
NA
NA
NA
NA
NA
NA
NA
...
17.08353
17.02800
16.14400
15.35130
15.47159
15.03758
14.70253
14.43586
14.34133
14.06036
5
Austria
TRY
NA
NA
NA
NA
NA
NA
NA
NA
...
28.73994
29.73884
29.90490
30.55073
31.38396
32.39439
32.71143
33.77378
34.20589
34.60450
6
Austria
UPPSRY
NA
NA
NA
NA
NA
NA
NA
NA
...
54.17653
53.23315
53.95110
54.09797
53.14444
52.56804
52.58605
51.79036
51.45277
51.33514
USCIS
N-400 form is provided in pdf format. It is read through each page and converted into a whole string. These strings are tokenized by words and those words are lemmentized. Among those words unnecessary words such as “uscis”, “answer”, or “page” are removed. Then cleaned words are merged again into a whole new string.
Code
import fitzimport spacyfrom collections import Counternlp = spacy.load("en_core_web_sm")# Make a function that cleans a pdf filedef clean_pdf(file): pdf_document = fitz.open(file) pdf_txt =""# Loop through each pages in the document and compile all text into one stringfor page_number inrange(pdf_document.page_count): page_txt = pdf_document[page_number].get_text() pdf_txt = pdf_txt+page_txt cleaned_txt ="" doc = nlp(pdf_txt)# Loop through each words in pdf stringfor token in doc:if token.is_alpha ==True: # Check whether token is alphabetiflen(token) >2: # Remove tokens that has less then 3 charactersif token.is_stop !=True: # Remove stop words token = token.lemma_ # Lemmentize the words token = token.lower() # Change to lower caseif token notin ["yes","uscis","answer","number","yyyy","form","edition","page","united","states","code"]: # Remove some unnecessary words cleaned_txt = cleaned_txt+token+" "return cleaned_txtn_400 = clean_pdf("./data/n-400.pdf")n_400# with open("./data/cleaned_n_400.txt", "w") as file:# file.write(n_400)
'use explain apply basis qualify military service lawful permanent resident year addition marry live citizen spouse year spouse citizen year time file lawful permanent resident year year age information eligibility select box delay enter digit date stamp remarks current legal provide nickname family give middle applicable information person apply naturalization family give exactly appear permanent resident card applicable lawful permanent resident spouse citizen citizen spouse regularly engage specify employment abroad immigration nationality act ina section residential address outside file section select field office list like naturalization interview middle applicable start type print black ink type print item applicable indicate failure question delay citizenship immigration services process note complete part action block receipt application naturalization department homeland security citizenship immigration services omb expire biological legal adoptive mother father citizen birth naturalize reach birthday citizen consider file application visit website information topic review instruction application certificate citizenship application citizenship issuance certificate section note parent citizen complete information parent application skip biographic information information person apply naturalization continue middle applicable give family type print new like use space provide like legally change read instructions decide like legally change change optional date lawful permanent resident social security applicable country citizenship nationality country birth date birth online account gender male female physical developmental disability mental impairment prevent demonstrate knowledge understanding english language civic requirement naturalization submit complete medical certification disability exceptions file exemption english language test year age old live lawful permanent resident period total year time file year age old live lawful permanent resident period total year time file year age old live lawful permanent resident period total year time file meet requirement give simplified version civic test accommodation individual disabilities impairments deaf hard hear request following accommodation request sign language interpreter indicate language example american sign language select applicable box request accommodation disability impairment blind low vision request following accommodation note read information instructions complete middle applicable give family names birth include nickname alias maiden applicable information contact information residence province region foreign address city town street zip state county flr ste apt country foreign address live year provide recent residence list location live year need extra space use additional sheet paper current physical address postal foreign address daytime telephone work telephone evening telephone mobile telephone email address city town state zip street care flr ste apt postal foreign address province region foreign address country foreign address current mailing address different address date residence county type disability impairment example use wheelchair describe nature disability impairment accommodation request accommodation individual disabilities impairment continued usps zip lookup present information residence continued street flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address street flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address street date residence flr ste apt city town county state zip postal foreign address province region foreign address physical address country foreign address information parent mother citizen complete follow information item information mother parent marry birthday date residence date residence parent citizen skip mother country birth mother date birth date mother citizen know mother information parents continued middle applicable give family father country birth father date birth complete information date father citizen know father information father father citizen current legal citizen father american indian alaska native note require complete category conduct background check instructions information biographic information height feet inches race select applicable box native hawaiian pacific islander black african american asian white hispanic latino hispanic latino ethnicity select box weight pounds brown blue green hazel gray black pink maroon eye color select box bald hair sandy red white gray blond brown black hair color select box current legal citizen mother middle applicable give family list work attend school time time year provide information complete time period include military police intelligence service begin provide information recent current employment study unemployment applicable provide location date work self employ unemployed study year work type print self employ unemployed type print unemployed need extra space use additional sheet paper city town state zip flr apt ste street employer school occupation date date employer school city town state zip street flr ste apt postal foreign address province region foreign address postal foreign address province region foreign address information employment schools attend country foreign address occupation date date country foreign address employer school street flr ste apt zip state city town occupation date date postal foreign address province region foreign address country foreign address list trip hour long take outside year start recent trip work backwards need extra space use additional sheet paper date leave date return trip month countries travel total day outside trip hour long take outside year trip total day hour long spend outside year day time outside current marital status time marry include annulled marriage marriage people marriage person marry provide follow information current spouse current spouse legal divorced single marry widow marry marriage annul middle applicable give family separate marry spouse current member armed force current spouse previous legal middle applicable give family names current spouse include nickname alias maiden applicable family give middle applicable information marital history current spouse date birth date enter marriage current spouse single marry information marital history continued street province region foreign address postal foreign address zip state county city town current spouse present home address current spouse current employer company flr ste apt current spouse citizen item item country foreign address date current spouse citizen current spouse citizen complete follow information current spouse country citizenship nationality current spouse current spouse immigration status explain lawful permanent resident current spouse citizen current spouse citizen complete follow information birth complete following information item middle applicable give family time current spouse marry include annulled marriage marriage people marriage person current spouse marry provide follow information current spouse prior spouse legal current spouse prior spouse immigration status current spouse prior spouse know citizen lawful permanent resident explain date birth current spouse prior spouse country birth current spouse prior spouse country citizenship nationality current spouse prior spouse current spouse previous marriage provide information additional sheet paper information marital history continued current spouse date marriage prior spouse date current spouse marriage end prior spouse current spouse marriage end prior spouse annul divorced spouse deceased explain marry provide follow information prior spouse previous marriage provide information additional sheet paper explain spouse deceased divorced annul prior spouse immigration status marriage end know citizen lawful permanent resident explain middle applicable give family prior spouse date birth date marriage prior spouse date marriage end prior spouse marriage end prior spouse prior spouse country birth prior spouse country citizenship nationality prior spouse legal indicate total child indicate child include child alive miss deceased child bear country child year age old child currently married unmarried child live current stepchild legally adopt child child bear marry information child provide follow information child son daughter list item regardless age list additional child use additional sheet paper middle applicable give family child current legal date birth country birth information child continued child middle applicable give family date birth current address street province region foreign address postal foreign address zip state county city town child relationship example biological child stepchild legally adopt child flr ste apt current address street province region foreign address postal foreign address zip state county city town flr ste apt child relationship example biological child stepchild legally adopt child country foreign address country foreign address current legal current legal middle applicable give family country birth child country birth date birth information child continued current address street province region foreign address postal foreign address zip state county city town flr ste apt child relationship example biological child stepchild legally adopt child current legal middle applicable give family country birth current address street province region foreign address postal foreign address zip state county city town flr ste apt date birth child relationship example biological child stepchild legally adopt child country foreign address country foreign address child additional information person apply naturalization register vote federal state local election claim citizen writing way item numbers question include type print explanation additional sheet paper vote federal state local election hereditary title order nobility foreign country willing inherit title order nobility foreign country naturalization ceremony declare legally incompetent confine mental institution additional information person apply naturalization continue owe overdue federal state local taxis file federal state local tax return lawful permanent resident consider non resident call non resident federal state local tax return lawful permanent resident member involve way associate organization association fund foundation party club society similar group location world provide information need extra space attach name group additional sheet paper provide evidence support group purpose group date membership member way associate directly indirectly communist party totalitarian party advocate directly indirectly overthrow government force violence march work associate way directly indirectly nazi government germany government area occupy ally establish help nazi government germany persecute directly indirectly person race religion national origin membership particular social group political opinion terrorist organization german nazi military unit paramilitary unit self defense unit vigilante unit citizen unit police unit government agency office extermination camp concentration camp prisoner war camp prison labor camp transit camp additional information person apply naturalization continue involve way following genocide torture killing try kill badly hurt try hurt person purpose forcing try force kind sexual contact relation let practice religion member serve help participate follow group military unit paramilitary unit group people act like military group official military police unit self defense unit vigilante unit group people act like police official police rebel group guerrilla group group people use weapon physically attack military police government people militia army people official military insurgent organization group use weapon fight government worker volunteer soldier serve following labor camp place people force work detention facility place people force stay prison camp prison jail place people force stay group help group use weapon person group help group unit organization weapon person threaten group help group tell person use weapon person sell provide weapon person help person sell provide weapon person know person go use weapon person know person go sell weapon go use person convict crime offense charge commit attempt commit assist commit crime offense arrest cite detain law enforcement officer include immigration official official armed force reason commit assist commit attempt commit crime offense arrest item numbers apply record seal expunged clear disclose information include judge law enforcement officer attorney tell long constitute record tell disclose information additional information person apply naturalization continue complete probation parole receive suspend sentence place probation parole jail prison long jail prison year month day question item numbers complete table need extra space use additional sheet paper provide evidence support arrest cite detain charge date arrest cite detain charge arrest cite detain charge city town state country outcome disposition arrest citation detention charge charge file charge dismiss jail probation etc place alternative sentencing rehabilitative program example diversion deferred prosecution withhold adjudication deferred adjudication question item numbers skip item item receive type military paramilitary group people act like military group official military weapon training recruit ask enlist sign conscript require use person year age serve help armed force group use person year age help support people combat item numbers question item numbers include type print explanation additional sheet paper provide evidence support habitual drunkard prostitute procure prostitution sell smuggle control substance illegal drug narcotic married person time helped enter try enter illegally gamble illegally receive income illegal gambling marry order obtain immigration benefit additional information person apply naturalization continue give government official information documentation false fraudulent misleading lie government official gain entry admission gain immigration benefit remove exclude deport order remove exclude deport removal exclusion rescission deportation proceeding include administratively close proceeding currently pende place removal exclusion rescission deportation proceeding serve armed force currently member armed force fail support dependent pay alimony misrepresentation obtain public benefit schedule deploy overseas include vessel month refer address change section instruction notify learn deployment plan file currently station overseas court martiale administratively separate discipline receive honorable discharge armed force discharge training service armed force alien leave avoid draft armed force apply kind exemption military service armed force desert armed force male live time birthday include live lawful nonimmigrant selective service date register register selective service provide information additional information person apply naturalization continue item numbers question include type print explanation additional sheet paper provide evidence support support constitution government understand oath allegiance law require willing bear arm behalf law require willing perform noncombatant service armed force willing oath allegiance law require willing perform work national importance civilian direction register selective service system year age register apply naturalization complete selective service information year age year age file ina section register selective service attach statement explain register provide status information letter selective service applicant statement certification signature note read penalties section instructions complete read understand english read understand question instruction application question applicant statement interpreter question interpreter name read question instruction application language fluent understand note select box item item applicable select box item applicant statement applicant statement preparer request preparer name prepare application base information provide authorize applicant statement certification signature continued applicant certification understand require appear appointment biometric fingerprint photograph signature time require sign oath reaffirming applicant signature applicant signature date signature note applicant completely fill application fail submit require document list instruction deny application interpreter business organization city town state postal province street apt flr ste country interpreter mailing address zip interpreter give interpreter family provide follow information interpreter interpreter interpreter contact information certification signature copies document submit exact photocopy unaltered original document understand require submit original document later date furthermore authorize release information record need determine eligibility immigration benefit seek authorize release information contain application support document record entity person necessary administration enforcement immigration law certify penalty perjury provide authorize information application understand information contain submit application information complete true correct review provide authorize information application understand information contain submit application information complete true correct time filing interpreter contact information certification signature continue interpreter certification certify penalty perjury fluent english language specify item item read applicant identify language question instruction application question applicant inform understand instruction question application include applicant certification verify accuracy date signature interpreter signature interpreter signature contact information declaration signature person prepare application applicant provide follow information preparer preparer preparer give preparer family preparer business organization street apt flr ste city town state postal province country preparer mailing address zip interpreter contact information interpreter daytime telephone interpreter email address interpreter mobile telephone contact information declaration signature person prepare application applicant continued preparer contact information preparer daytime telephone preparer mobile telephone preparer email address attorney accredit representative prepare application behalf applicant applicant consent attorney accredit representative representation applicant case extend preparation application preparer statement extend note attorney accredit representative representation extend preparation application oblige submit complete notice entry appearance attorney accredited representative application preparer certification signature certify penalty perjury prepare application request applicant applicant review complete application inform understand information contain submit application include applicant certification information complete true correct complete application base information applicant provide authorize obtain use preparer signature date signature preparer signature note complete parts officer instruct interview swear affirm certify penalty perjury law america know content application naturalization subscribe include correction complete true correct evidence submit complete true correct signature interview subscribe swear affirm officer printed stamp date signature applicant signature officer signature item item affirm following officer renounce title heretofore hold renounce order nobility heretofore belong applicant signature applicant printed officer signature officer printed list title list order nobility renunciation foreign titles date signature obligation freely mental reservation purpose evasion help god perform work national importance civilian direction require law perform noncombatant service armed force require law applicant signature applicant printed middle applicable give family date signature bear arm behalf require law bear true faith allegiance support defend constitution law america enemy foreign domestic declare oath absolutely entirely renounce abjure allegiance fidelity foreign prince potentate state sovereignty heretofore subject citizen application approve schedule public oath ceremony time require follow oath allegiance immediately prior naturalized citizen sign acknowledge willingness ability oath oath allegiance '
Migration Policy Institute
Similar to the above, the text data from MPI are cleaned by lemmentizing and removing short words.
Code
mpi_raw = pd.read_csv("./data/MPI_raw.csv")cleaned_txt = []for sentence in mpi_raw["text"]: doc = nlp(sentence) cleaned_sentence =""for token in doc:if token.is_alpha ==True: # Check whether token is alphabetiflen(token) >2: # Remove tokens that has less then 3 charactersif token.is_stop !=True: # Remove stop words token = token.lemma_ # Lemmentize the words token = token.lower() # Change to lower case cleaned_sentence = cleaned_sentence+token+" " cleaned_txt.append(cleaned_sentence)mpi_raw["text"] = np.array(cleaned_txt)mpi_raw.head()# mpi_raw.to_csv("./data/mpi_cleaned.csv", index = False)