Dimensionality Reduction

Dimensionality reduction is a helpful data science technique when the dataset has a huge dimension. The dimension is reduced by extracting new components out of input features while maintaining as much variance as possible. These new components make data science algorithms faster to compute and easier to visualize.

1. Principal Component Analysis
Principal Component Analysis (PCA) is reducing the dimension using linear relationships between features. It computes eigenvectors from the covariance matrix determining the direction of the data and selects the top components to explaining the most variance needed.

2. T-distributed Stochastic Neighbor Embedding
Unlike PCA, T-distributed Stochastic Neighbor Embedding (TSNE) doesn’t need to have use linear relationship on making new components. TSNE uses the similarity between features, such as KL divergence, using a t-distribution. TSNE is particularly good at preserving the local structures of features that is helpful to understand complex data.

Both PCA and TSNE is used in this section. In real world example, there will be no labels, but for the measure of effectiveness of those two methods, we are plotting their components with success rate labels. Through that, we will be able to see how much structure or variance are preserved in their components. Then there will be a comparison between PCA and TSNE on their performance.

Data Preparation

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy import stats
sns.set_theme(palette="Set2")

acs = pd.read_csv("./data/acs_cleaned.csv")

acs['NATIVITY'] = acs['NATIVITY'].astype('category')
acs['DECADE'] = acs['DECADE'].astype('category')
acs['ENG'] = acs['ENG'].astype('category')
acs['MAR'] = acs['MAR'].astype('category')
acs['RAC1P'] = acs['RAC1P'].astype('category')
acs['SEX'] = acs['SEX'].astype('category')
acs['ESR'] = acs['ESR'].astype('category')
acs['SCHL'] = acs['SCHL'].astype('category')

immigrants = acs[acs["NATIVITY"]==2]

df = pd.get_dummies(immigrants.drop(["POBP","WAGP","SCHL","NATIVITY"],axis=1))
df['POBP'] = immigrants["POBP"]
X = df.groupby("POBP").agg("mean")
X = X.reset_index()
y = X["SUCCESS"] > 0.5
X["NORM_AGEP"] = (X["AGEP"] - np.min(X["AGEP"]))/(np.max(X["AGEP"])-np.min(X["AGEP"]))
X = X.drop(["AGEP","POBP","SUCCESS"],axis=1)
X.head()

	DECADE_1	DECADE_2	DECADE_3	DECADE_4	DECADE_5	DECADE_6	DECADE_7	DECADE_8	ENG_0	...	RAC1P_9	SEX_1	SEX_2	ESR_1	ESR_2	ESR_3	ESR_4	ESR_6	NORM_AGEP
0	0.004357	0.004357	0.019608	0.061002	0.276688	0.224401	0.209150	0.200436	0.087146	...	0.315904	0.485839	0.514161	0.557734	0.017429	0.065359	0.004357	0.355120	0.350466
1	0.006897	0.003448	0.020690	0.100000	0.144828	0.244828	0.313793	0.165517	0.331034	...	0.034483	0.517241	0.482759	0.641379	0.024138	0.037931	0.010345	0.286207	0.368953
2	0.003300	0.004950	0.016502	0.024752	0.024752	0.341584	0.415842	0.168317	0.099010	...	0.006601	0.480198	0.519802	0.694719	0.023102	0.056106	0.000000	0.226073	0.310732
3	0.012346	0.037037	0.037037	0.037037	0.092593	0.271605	0.296296	0.216049	0.148148	...	0.043210	0.611111	0.388889	0.654321	0.012346	0.061728	0.000000	0.271605	0.398003
4	0.000000	0.025974	0.064935	0.181818	0.324675	0.207792	0.116883	0.077922	0.324675	...	0.324675	0.415584	0.584416	0.610390	0.012987	0.038961	0.000000	0.337662	0.506213

5 rows × 37 columns

Since most of feaures in the US Census dataset are categorical, the dataset is aggreagated by the POBP column. POBP,WAGP,SCHL, and NATIVITY columns are dropped. And AGEP column is normalized so that it can have a same range with other variables. On this new dataset, the PCA and TSNE clustering methods will be applied. Then there will be a comparison between PCA and TSNE.

PCA

Code

# Initialize PCA with the number of components as 5
pca = PCA(n_components=5)

# Fit PCA on the data and transform it into principal components
Xc = pca.fit_transform(X)

# Calculate the proportion of variance explained by each principal component
proportion_of_variance = pca.explained_variance_ / np.sum(pca.explained_variance_)

# Plot the cumulative proportion of variance explained by the principal components
plt.plot(np.arange(1, 6), [sum(proportion_of_variance[:i+1]) for i in range(len(proportion_of_variance))], marker="o")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative proportion variance explained")
plt.title("Variance explained on number of components")
plt.show()

Looking at the plot of cumulative proportion variance explained, the optimal number of prinicpal component is 3 as it explains more than 80% of variance of the original dataset.

Code

sns.scatterplot(x=Xc[:,0], y=Xc[:,1], hue=y)
plt.title("PCA")
plt.xlabel("First Component")
plt.ylabel("Second Component")
plt.show()

The plot above is plotting first two principal component explaining about 70% of the variance. It is hard to see clear clusters on the space.

TSNE

Code

# Define a range of perplexity values from 1 to 10 with intervals of 10 up to 150
per = np.append(1,np.arange(10,150,10))
# Initialize an empty list to store KL divergence values
kld = []
# Iterate through each perplexity value
for i in per:
    # Create a t-SNE model with 2 components and the current perplexity value
    model = TSNE(n_components=2, perplexity=i, init='random')
    # Fit the t-SNE model and transform the data
    Xt = model.fit_transform(X)
    # Append the KL divergence value to the list
    kld.append(model.kl_divergence_)
# Print the optimal hyperparameter (perplexity) based on the minimum KL divergence
print("Optimal hyper parameter:",per[np.argmin(kld)])

Optimal hyper parameter: 140

Based on the KL divergence, the optimal perplexity is 140. This means TSNE is not a good dimentionality reduction method for this dataset because 140 is very close the actual number of rows.

Code

# Initialize a t-SNE model with optimal perplexity (obtained from previous calculations)
tsne1 = TSNE(n_components=2, perplexity=per[np.argmin(kld)], init='random')
# Fit the t-SNE model and transform the original data
Xt1 = tsne1.fit_transform(X)
sns.scatterplot(x=Xt1[:,0], y=Xt1[:,1], hue=y)
plt.title("TSNE (Perplexity = 140)")
plt.show()

As we can see in the plot, TSNE with perpleity 140 is not going well on distinguishing countries.

Code

tsne2 = TSNE(n_components=2, perplexity=30, init='random')
Xt2 = tsne2.fit_transform(X)
sns.scatterplot(x=Xt2[:,0], y=Xt2[:,1], hue=y)
plt.title("TSNE (Perplexity = 30)")
plt.show()

When perplexity is reduced to 30, the plot has less groups of points, yet it is still not good at differentiating higher and lower successful rate countries.

Results

Both PCA and TSNE are not able to create meaningful components explaining the diffence of success rate between countries.

Code

def merit(x,y,correlation="pearson"):
    # x=matrix of features 
    # y=matrix (or vector) of targets 
    # correlation="pearson" or "spearman"
    k = x.shape[1]
    if correlation == "pearson":
        rho_xx = np.mean(np.corrcoef(x,x,rowvar=False))
        rho_xy = np.mean(np.corrcoef(x,y,rowvar=False))
    elif correlation == "spearman":
        rho_xx = np.mean(stats.spearmanr(x,x, axis=0)[0])
        rho_xy = np.mean(stats.spearmanr(x,y, axis=0)[0])
    merit = k*np.absolute(rho_xy)/(np.sqrt(k+k*(k-1)*np.absolute(rho_xx)))
    return merit
print("The merit of PCA:",merit(Xc,y))
print("The merit of TSNE when perplexity = 140:",merit(Xt1,y))
print("The merit of TSNE when perplexity = 50:",merit(Xt2,y))

The merit of PCA: 0.2660165732902509
The merit of TSNE when perplexity = 140: 0.3456705197613567
The merit of TSNE when perplexity = 50: 0.6098311363567865

In terms of merit, TSNE is better methods on feature extraction tha is highly correlated with the output but uncorrelated to each other.