Clustering

Clustering is an unsupervised machine learning technique. Rather than predicting the labels or target numbers, it discovers unseen information by forming clusters from the data.

1. KMeans Clustering
KMeans clustering is an approach that finds the k clusters based on centroids. Those centroids are “centers” of the cluster calculated by taking the mean of all distances between all points in a cluster and the centroid. For the distances, Euclidean or Manhattan distance are used. This process of assigning clusters continues until the algorithm converges having the minimum error.

2. DBSCAN
DBSCAN, Density Based Spatial Clustering of Applications with Noise, finds spaces that has the highest density around and builds clusters around. This is different from KMeans clustering that DBSCAN are not impacted by the distance between the points. This allows outliers to be their own cluster unlike KMeans clustering.

3. Agglomerative Clustering
Agglomerative Clustering, also known as Hierarchical Clustering, has two ways of creating clusters. Agglomerative method starts with assigning every single data point into their own cluster. Then it starts to merge those clusters until there is only one big clustering containing all the data points. On the other hand, divisive method starts from the one big cluster containing all points and divides the cluster until each point forms an own cluster. In the process of merging or dividing clusters, Agglomerative Clustering uses linkage function to figure out which and how to merge or divide the clusters.

In this section, those three clustering methods will be used to find the relationship between the success rate and the place of birth. This will allow us to see whether there are some differences between countries on their success rate based on the features. There will be a hyper parameter tuning for all three methods and then the comparison on their clusters along with true clusters.

Data Preparation

# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score


sns.set_theme(palette="Set2")
acs = pd.read_csv("./data/acs_cleaned.csv")

# Replace zero values in 'WAGP' column with 2 to avoid zero values when taking the logarithm
acs.loc[acs['WAGP'] == 0, 'WAGP'] = 2
# Create a new column 'NORM_WAGP' containing the logarithm of 'WAGP'
acs["NORM_WAGP"] = np.log(acs["WAGP"])

# Convert specific columns to categorical type for better memory usage and analysis
acs['NATIVITY'] = acs['NATIVITY'].astype('category')
acs['DECADE'] = acs['DECADE'].astype('category')
acs['ENG'] = acs['ENG'].astype('category')
acs['MAR'] = acs['MAR'].astype('category')
acs['RAC1P'] = acs['RAC1P'].astype('category')
acs['SEX'] = acs['SEX'].astype('category')
acs['ESR'] = acs['ESR'].astype('category')
acs['SCHL'] = acs['SCHL'].astype('category')

# Filter the dataset to consider only immigrants
immigrants = acs[acs["NATIVITY"]==2]

# Create dummy variables for certain columns and drop specific columns from the dataframe
df = pd.get_dummies(immigrants.drop(["POBP", "WAGP", "NATIVITY", "SCHL"], axis=1))
df2 = pd.get_dummies(immigrants[["SCHL"]])

# Group by 'POBP' (Place of Birth) and calculate mean for 'SCHL' (Educational Attainment) values
df['POBP'] = immigrants["POBP"]
df2['POBP'] = immigrants["POBP"]
SCHL = df2.groupby("POBP").agg("mean")
SCHL = SCHL.reset_index()

# Group by 'POBP' (Place of Birth) and calculate mean for other columns
X = df.groupby("POBP").agg("mean")
X = X.reset_index()

# Create a binary target variable 'y' based on 'SUCCESS' column values
y = X["SUCCESS"] > 0.5

# Create a variable 'wagp' containing 'NORM_WAGP' column values
wagp = X["NORM_WAGP"]

# Normalize 'AGEP' column values and drop unnecessary columns
X["NORM_AGEP"] = (X["AGEP"] - np.min(X["AGEP"])) / (np.max(X["AGEP"]) - np.min(X["AGEP"]))
X = X.drop(["AGEP", "POBP", "SUCCESS", "NORM_WAGP"], axis=1)
X.head()

	DECADE_1	DECADE_2	DECADE_3	DECADE_4	DECADE_5	DECADE_6	DECADE_7	DECADE_8	ENG_0	...	RAC1P_9	SEX_1	SEX_2	ESR_1	ESR_2	ESR_3	ESR_4	ESR_6	NORM_AGEP
0	0.004357	0.004357	0.019608	0.061002	0.276688	0.224401	0.209150	0.200436	0.087146	...	0.315904	0.485839	0.514161	0.557734	0.017429	0.065359	0.004357	0.355120	0.350466
1	0.006897	0.003448	0.020690	0.100000	0.144828	0.244828	0.313793	0.165517	0.331034	...	0.034483	0.517241	0.482759	0.641379	0.024138	0.037931	0.010345	0.286207	0.368953
2	0.003300	0.004950	0.016502	0.024752	0.024752	0.341584	0.415842	0.168317	0.099010	...	0.006601	0.480198	0.519802	0.694719	0.023102	0.056106	0.000000	0.226073	0.310732
3	0.012346	0.037037	0.037037	0.037037	0.092593	0.271605	0.296296	0.216049	0.148148	...	0.043210	0.611111	0.388889	0.654321	0.012346	0.061728	0.000000	0.271605	0.398003
4	0.000000	0.025974	0.064935	0.181818	0.324675	0.207792	0.116883	0.077922	0.324675	...	0.324675	0.415584	0.584416	0.610390	0.012987	0.038961	0.000000	0.337662	0.506213

5 rows × 37 columns

The US Census data has 12 columns with 9 categorical variables. Since these clustering methods are not applicable to categorical variables, the dataset is aggregated by the POBP column. During the process, SCHL and WAGP columns are droped and reprocessed for the cluster plotting because the SUCCESS label is based on those columns. And AGEP column is normalized so that it can have a same range with other variables.

Hyper Parameter Tuning

KMeans Clustering

Code

# Initialize empty lists to store silhouette scores and the number of clusters
sil_score = []
n_cluster = []

# Loop through a range of values for KMeans clusters from 1 to 50
for i in range(1, 50):
    model = KMeans(n_clusters=i+1, n_init=1).fit(X)
    try:
        # Calculate silhouette score and store it along with the corresponding cluster count
        score = silhouette_score(X, model.labels_)
        n_cluster.append(i+1)
        sil_score.append(score)
    except:
        continue

# Plot a line plot showing how silhouette score changes with different cluster counts
sns.lineplot(x=n_cluster, y=sil_score)
plt.xlabel("Clusters")
plt.ylabel("Silhouette Score")
plt.show()

# Find the cluster count (K) with the highest silhouette score
print("The silhouette score is highest when K =", n_cluster[np.argmax(sil_score)])

# Fit a KMeans model using the K value with the highest silhouette score
kmeans = KMeans(n_clusters=n_cluster[np.argmax(sil_score)], n_init=1).fit(X)

The silhouette score is highest when K = 5

DBSCAN

Code

l_eps = np.arange(0.1, 2, 0.1)
n_sample = range(1, 10)
sil_score = []
n_cluster = []
best_score = 0

# Loop through different combinations of epsilon and minimum samples
for eps in l_eps:
    for n in n_sample:
        # Fit DBSCAN model with varying eps and min_samples
        model = DBSCAN(eps=eps, min_samples=n).fit(X)
        try:
            # Calculate silhouette score and store it along with the corresponding cluster count
            score = silhouette_score(X, model.labels_)
            n_cluster.append(len(np.unique(model.labels_)))
            sil_score.append(score)
            
            # Update best_score and optimal parameters if a higher score is found
            if score > best_score:
                best_score = score
                opt_eps = eps
                opt_sample = n
        except:
            continue

# Plot a line plot showing how silhouette score changes with different cluster counts
sns.lineplot(x=n_cluster, y=sil_score)
plt.xlabel("Clusters")
plt.ylabel("Silhouette Score")
plt.show()        

# Print the best parameters for epsilon and minimum samples and their corresponding silhouette score
print("The best parameters for eps and min_sample are", opt_eps, "and", opt_sample, "having the highest silhouette score of", best_score)

# Fit a DBSCAN model using the optimal epsilon and minimum samples
dbscan = DBSCAN(eps=opt_eps, min_samples=opt_sample).fit(X)

The best parameters for eps and min_sample are 0.5 and 9 having the highest silhouetter score of 0.4197274085360764

Agglomerative Clustering

Code

# Initialize empty lists to store silhouette scores and the number of clusters
sil_score = []
n_cluster = []

# Loop through different numbers of clusters from 2 to 50
for i in range(2, 51):
    # Fit Agglomerative Clustering model with 'i' clusters
    model = AgglomerativeClustering(n_clusters=i).fit(X)
    
    # Obtain cluster labels from the fitted model
    labels = model.labels_
    
    # Calculate silhouette score and store it along with the corresponding cluster count
    sil_score.append(silhouette_score(X, labels))
    n_cluster.append(i)

# Plot a line plot showing how silhouette score changes with different cluster counts
plt.plot(n_cluster, sil_score)
plt.xlabel("Clusters")
plt.ylabel("Silhouette Score")
plt.show()

# Find the cluster count (K) with the highest silhouette score
print("The silhouette score is highest when K =", n_cluster[np.argmax(sil_score)])

# Fit an Agglomerative Clustering model using the K value with the highest silhouette score
ag = AgglomerativeClustering(n_clusters=n_cluster[np.argmax(sil_score)]).fit(X)

The silhouette score is highest when K = 5

Clustering

Code

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 9))
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=y,data = X, ax = axes[0,0])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=kmeans.labels_,data = X, ax = axes[0,1])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=dbscan.labels_,data = X, ax = axes[1,0])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=ag.labels_,data = X, ax = axes[1,1])
axes[0,0].set_title("True Label")
axes[0,1].set_title("KMeans")
axes[1,0].set_title("DBSCAN")
axes[1,1].set_title("Agglomerative Clustering")
plt.suptitle("Clustering on optimal parameters")
plt.tight_layout()
plt.show()

With optimal parameters, all of methods are not clustering the countries well based on their success rate. KMeans and Agglomerative Clustering have a very weak trend. In Agglomerative Clustering, ligher color coded countries are on the left and darker color countries on the right. And in KMeans, there is a cluster where ligher color coded countries are in the middle.

Code

kmeans2 = KMeans(n_clusters=2,n_init=1).fit(X)
ag2 = AgglomerativeClustering(n_clusters=2).fit(X)
dbscan2 = DBSCAN(eps=0.9,min_samples=8).fit(X)
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 9))
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=y,data = X, ax = axes[0,0])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=kmeans2.labels_,data = X, ax = axes[0,1])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=dbscan2.labels_,data = X, ax = axes[1,0])
sns.scatterplot(x=wagp,y=SCHL["SCHL_21"],hue=ag2.labels_,data = X, ax = axes[1,1])
axes[0,0].set_title("True Label")
axes[0,1].set_title("KMeans")
axes[1,0].set_title("DBSCAN")
axes[1,1].set_title("Agglomerative Clustering")
plt.suptitle("Clustering with 2 clusters")
plt.tight_layout()
plt.show()

For the comparison, all three methods are forced to create only 2 clusters. KMeans and Agglomerative Clustering produce similar clusters between each other, but those clusters are still different from true cluster. DBSCAN is doing worse that clustering almost every countries into one cluster.

Results

The result is limited because the dataset is aggreagated that input data was not fully explaining the whole dataset. All three methods are failed to create clear clusters among countries on their optimal hyper parameter. This might imply that there is a no clear differences on immigrants based on their place of birth. However, when the methods are forced to create only two clusters, KMeans and Agglomerative clustering made similar clusters on the space of SCHL and normalied wage.