import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os
import pandas as pd
import scipy.stats
import sklearn.datasets
import sklearn.preprocessing
%matplotlib notebook
#plt.xkcd()
None
A data visualization goes a long way in helping you understand the underlying dataset. If the data is highly dimensional, you can use Singular Value Decomposition (SVD) to find a reduced-rank approximation of the data that can be visualized easily.
We start off with the Iris flower dataset. The data is multivariate, with 150 measurements of 4 features (length and width cm of both sepal and petal) on 3 distinct Iris species. Of the 150 measurements, there are 50 measurements each for Iris setosa, Iris versicolor, and Iris virginica.
Scikit Learn's datasets
includes the Iris dataset, so let's load that up and start exploring.
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
print('Iris dataset has {} rows and {} columns\n'.format(*df_iris.shape))
print('Here are the first 5 rows of the data:\n\n{}\n'.format(df_iris.head(5)))
print('Some simple statistics on the Iris dataset:\n\n{}\n'.format(df_iris.describe()))
As we are exploring the dataset, it would be nice to view the data in order to get an idea of how the 3 species might be distributed with respect to one another in terms of their features. Perhaps we are interested in finding clusters, or maybe we would like to find a way to make class predictions?
However, since the data has 4 dimensions, we would be hard-pressed to come up with a good way to graph the data in 4D that we could easily understand.
But what if we could reduce or compress the data so that we could work in 3 dimensions or less?
Singular value decomposition lets us do just that.
Singular value decomposition factorizes an $\mathbb{R}^{m \times n}$ matrix $X$ into
such that
\begin{align} X &= U \, \Sigma \, V^{\intercal} \end{align}We can use numpy.linalg.svd
to factorize the Iris data matrix into three components $U$, $\Sigma$, and $V^{\intercal}$.
U_iris, S_iris, Vt_iris = np.linalg.svd(df_iris, full_matrices=False)
The rows of the $U$ correspond to the rows of original data matrix $X$, while the columns are the set of ordered, orthornormal eigenvectors of $X \, X^{\intercal}$.
print('matrix U has {} rows, {} columns\n'.format(*U_iris.shape))
print('here are the first 5 rows.')
print('{}'.format(pd.DataFrame(U_iris).head(5)))
numpy.linalg.svd
actually returns $V^{\intercal}$ instead of $V$, so it is the columns of $V^{\intercal}$ that correspond to the columns of original data matrix $X$. Hence, the rows are the set of ordered, orthornormal eigenvectors of $X^{\intercal} \, X$.
print('matrix Vt has {} rows, {} columns\n'.format(*Vt_iris.shape))
print('{}'.format(pd.DataFrame(Vt_iris).head()))
The elements $\sigma_{i}$ of diagonal matrix $\Sigma$ are the non-zero singular values of matrix $X$, which are really just the square roots of the non-zero eigenvalues of $X^{\intercal} \, X$ (and also for $X \, X^{\intercal}$). These singular values can be used to determine the amount of variance $X^{\prime}$ explains of the original data matrix $X$ when reducing the dimensions to find a lower rank approximation.
\begin{align} X^{\prime}_{k} &= U_{k} \, \Sigma_{k} \, V^{\intercal}_{k} \\ &\approx X_{r} & \text{ where } rank(X^{\prime}) = k \lt rank(X) = r \end{align}The amount of variance that the reduced rank approximation $X^{\prime}_{k}$ explains of $X_{r}$ is
\begin{align} \text{cum. variance explained} &= \frac{\sum_{j=1}^{k} \sigma_{j}^{2}}{\sum_{i=1}^{r} \sigma_{i}^{2}} \end{align}NOTE: numpy.linalg.svd
actually returns a $\Sigma$ that is not a diagonal matrix, but a list of the entries on the diagonal.
num_sv_iris = np.arange(1, S_iris.size+1)
cum_var_explained_iris = [np.sum(np.square(S_iris[0:n])) / np.sum(np.square(S_iris)) for n in num_sv_iris]
Let's have a look at the cumulative variance explained visually as a function of the number of singular values used when reducing rank to find a lower-ranked matrix $X^{\prime}$ to approximate $X$. This will inform us as to how many dimensions we should use.
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.plot(num_sv_iris,
cum_var_explained_iris,
color='#2171b5',
label='variance explained',
alpha=0.65,
zorder=1000)
plt.scatter(num_sv_iris,
sklearn.preprocessing.normalize(S_iris.reshape((1,-1))),
color='#fc4e2a',
label='singular values (normalized)',
alpha=0.65,
zorder=1000)
plt.legend(loc='center right', scatterpoints=1, fontsize=8)
ax.set_xticks(num_sv_iris)
ax.set_xlim(0.8, 4.1)
ax.set_ylim(0.0, 1.1)
ax.set_xlabel(r'Number of singular values used')
ax.set_ylabel('Variance in data explained')
ax.set_title('Iris dataset: cumulative variance explained & singular values',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.8, zorder=1)
plt.tight_layout()
Judging from the curve representing cumulative variance explained in the figure above, we can see that
Since graphing the Iris dataset in 1D wouldn't be all that interesting (just dots on a line segment), let's try using the first 2 singular values to represent the data on the $x$- and $y$-axes, respectively.
# the iris.target gives the actual labels for the data points
# we're just selecting the locations corresponding to data points labeled 0,1,2 resp
# then we can plot those using different colors.
idx_setosa = np.where(iris.target==0)[0]
idx_versicolor = np.where(iris.target==1)[0]
idx_virginica = np.where(iris.target==2)[0]
setosa_x = U_iris[idx_setosa, 0]
setosa_y = U_iris[idx_setosa, 1]
print(setosa_x.shape)
versicolor_x = U_iris[idx_versicolor, 0]
versicolor_y = U_iris[idx_versicolor, 1]
virginica_x = U_iris[idx_virginica, 0]
virginica_y = U_iris[idx_virginica, 1]
Let's do a 1-d plot first. Note that I am using the x-coordinate for both the axes below.
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.scatter(setosa_x,
setosa_x,
marker='o',
color='#66c2a5',
label='Iris-setosa',
zorder=1000)
plt.scatter(versicolor_x,
versicolor_x,
marker='D',
color='#fc8d62',
label='Iris-versicolor',
zorder=1000)
plt.scatter(virginica_x,
virginica_x,
marker='^',
color='#8da0cb',
label='Iris-virginica',
zorder=1000)
plt.legend(loc='upper left', scatterpoints=1, fontsize=8)
ax.set_xlabel(r'singular value $\sigma_{1}$')
ax.set_ylabel(r'singular value $\sigma_{2}$')
ax.set_title('2D plot of Iris dataset',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.6, zorder=1)
plt.tight_layout()
Even the first principal direction (the first singular vector) gives a pretty good representation of the data, the three clusters are quite visible, and there's a clear separation between the first and last clusters.
Now we represent the three Iris species on a 2D graph.
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.scatter(setosa_x,
setosa_y,
marker='o',
color='#66c2a5',
label='Iris-setosa',
zorder=1000)
plt.scatter(versicolor_x,
versicolor_y,
marker='D',
color='#fc8d62',
label='Iris-versicolor',
zorder=1000)
plt.scatter(virginica_x,
virginica_y,
marker='^',
color='#8da0cb',
label='Iris-virginica',
zorder=1000)
plt.legend(loc='upper left', scatterpoints=1, fontsize=8)
ax.set_xlabel(r'singular value $\sigma_{1}$')
ax.set_ylabel(r'singular value $\sigma_{2}$')
ax.set_title('2D plot of Iris dataset',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.6, zorder=1)
plt.tight_layout()
There!
Now that we are viewing the originally 4D data with 2 dimensions using the first 2 singular value columns of the $U$ left singular vectors matrix, we can see that there should be a very clear separation for the Iris setosa class and the others. On the other hand, the demarcation between Iris versicolor and Iris virginica might not be as clear cut.
Nevertheless, since this 2D reduced-rank matrix representation $X^{\prime}$ explains nearly 99.8% of the variance in the original dataset, we can be pretty certain that clustering and classification should be possible.
One more thing to mention: we used the first few columns of $U$ (i.e., the left singular vectors) instead of projecting the data points onto the right singular vectors. But remember that the left singular vectors were formed by projecting the data onto V, and then scaling coordinates: $U = XV \Sigma^{-1}$. So doing the actual projections would only rescale coordinates. Let's test this out (and check out the scale of the axes is now different):
# Projecting the data onto the right singular vectors
P_iris = df_iris.to_numpy().dot(Vt_iris.T)
#and this is the same code as above
idx_setosa = np.where(iris.target==0)[0]
idx_versicolor = np.where(iris.target==1)[0]
idx_virginica = np.where(iris.target==2)[0]
setosa_x = P_iris[idx_setosa, 0]
setosa_y = P_iris[idx_setosa, 1]
versicolor_x = P_iris[idx_versicolor, 0]
versicolor_y = P_iris[idx_versicolor, 1]
virginica_x = P_iris[idx_virginica, 0]
virginica_y = P_iris[idx_virginica, 1]
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.scatter(setosa_x,
setosa_y,
marker='o',
color='#66c2a5',
label='Iris-setosa',
zorder=1000)
plt.scatter(versicolor_x,
versicolor_y,
marker='D',
color='#fc8d62',
label='Iris-versicolor',
zorder=1000)
plt.scatter(virginica_x,
virginica_y,
marker='^',
color='#8da0cb',
label='Iris-virginica',
zorder=1000)
plt.legend(loc='upper left', scatterpoints=1, fontsize=8)
ax.set_xlabel(r'direction $v_{1}$')
ax.set_ylabel(r'direction $v_{2}$')
ax.set_title('2D plot of Iris dataset',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.6, zorder=1)
plt.tight_layout()
Principal components analysis (PCA) and singular value decomposition are closely related, and you may often hear both these terms used in the same breath.
Here is a quick mathematical treatment explaining how PCA and SVD are related.
Consider data matrix $X \in \mathbb{R}^{m \times n}$ where $m > n$, and all $x_{ij}$ are centered about the column means. With principal components analysis, we have
\begin{align} \text{covariance matrix } C &= \frac{1}{m} \, X^{\intercal} \, X & \text{from PCA} \\ &= \frac{1}{m} \, V \, \Gamma \, V^{\intercal} & \text{by eigendecomposition of } X^{\intercal} \, X \\ \\ \text{ but } X &= U \, \Sigma V^{\intercal} & \text{from SVD} \\ \\ \Rightarrow C &= \frac{1}{m} \, V \, \Sigma \, U^{\intercal} \, U \, \Sigma V^{\intercal} \\ &= \frac{1}{m} \, V \, \Sigma^{2} \, V^{\intercal} \\ \end{align}So we see that:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(iris.data)
# don't forget to mean-center the data before SVD
X = iris.data - np.mean(iris.data, axis=0)
U, S, Vt = np.linalg.svd(X)
Cov_pca = pca.get_covariance()
print('eigenvalues from PCA:\n{}\n'.format(np.linalg.eigvals(Cov_pca * X.shape[0])))
print('squared singular values from SVD:\n{}'.format(np.square(S)))
print('covariance matrix C derived from PCA:\n{}\n'.format(Cov_pca))
Cov_svd = (1. / X.shape[0]) * Vt.T.dot(np.diag(np.square(S))).dot(Vt)
print('covariance matrix using S and Vt from SVD:\n{}\n'.format(Cov_svd))
allclose = np.allclose(Cov_pca, Cov_svd, atol=1e-1)
print('Are these matrices equivalent (element-wise closeness comparison)?\n{}'.format(allclose))