Unsupervised Learning and PCA
Contents
Unsupervised Learning and PCA¶
Up until this point, our studies in regression and classification have only looked at supervised learning: where the labels are known. In unsupervised learning, well, we don’t have these labels anymore.
Definition 5 (Unsupervised Learning)
Unsupervised learning is machine learning on unlabelled data: no classes, no y-values. Instead of a human “supervising” the model, the model figures out patterns from data by itself.
Unsupervised learning tries to find some sort of structure in data. The three most common kinds of unsupervised learning:
Clustering: Partitioning data into groups, where each group has similar data points.
Dimensionality Reduction: High-dimensional data actually lies near a low-dimensional subspace or manifold. In other words, a lot of the features in the high-dimensional data are unnecessary.
Density Estimation: fitting a continuous distribution to discrete data. For example, using MLE to fit Gaussians to sample points.
Note the difference between clustering and dimensionality reduction is that clustering groups similar data points together, while dimensionality reduction is more focused on identifying a continuous variation from sample point to sample point.
Let’s talk about dimensionality reduction: specifically, principal component analysis.
Principal Component Analysis (PCA)¶
Principal Component Analysis (PCA) is a very famous example of dimensionality reduction. You are given
Let’s take a look at PCA, visually.

First, we see a bunch of points in 3D on the left. Imagine they are raindrops, frozen in time, hovering over a sheet of paper. Now, after time starts again: they’ll fall to the paper and make splotches corresponding to where they were in the air. In other words, we are reducing our raindrops/data points from 3 dimensions (air) to 2 dimensions (paper). How can we do this? Note the left diagram is showing us how: we pick a
So how do we choose such a subspace? We want the subspace such that when we project to the lower dimension, we can still tell the separation between the points. Formally, we want to choose
Definition 6 (Unsupervised Learning)
Principal Components are the
Why do we do PCA- or dimensionality reduction in general?
Dimensionality reduction is very often used in preprocessing! Reducing the number of dimensions makes some computations much cheaper (faster), and such computations might include regression and classification. So if our original data had 784 dimensions, like MNIST, perhaps we could find a way to map it down to something less.
Additionally, dimensionality reduction can be used as a form of regularization: when we remove irrelevant dimensions, we reduce overfitting. Dimensionality reduction is similar to feature subset selection in this sense. However, there is a key difference: in the latter we simply remove features, while in DR we remove directions. In other words, the new PCA “features” are no longer aligned with the feature axes, since they are now linear combinations of input features.
Finally, we also just might want to find a small basis that represents the variance in our data. We’ll see examples in this note: utilizing DR to detect variation in human faces as well as in human genetics.
Let
Let’s start simple: let’s say we have a
Definition 7 (Orthogonal Projection)
An orthogonal projection of a vector
Of course, if we project from, say, 100 dimensions to 1 dimension, we’re going to lose a LOT of information. Thankfully, we can project to several dimensions: this just means we must pick several different orthogonal direction vectors. We’re still going to orthogonally project points onto this subspace: just now, our subspace is defined by multiple orthogonal basis vectors instead of just one.
Definition 8 (Orthogonal Projection Formula)
For a
So a 3D point being projected onto a 2D space would look like this:

Practically, though, we more often just want the just the
Design matrix
Now let’s sort our eigenvalues, and now let
Why are the eigenvectors of
One way to see this is using MLE to fit a Gaussian to our data
For example, say we have fit this Gaussian (in blue) to the sample points (red X’s):

Note this Gaussian’s isocontours will be concentric ovals with the same shape as the border shown. Remember that we take the eigenvectors (the ellipse axes shown) and eigenvalues (the magnitudes of the vectors) of the sample covariance matrix
Now, let’s sketch out the actual PCA algorithm.
Algorithm 5 (PCA)
Inputs
Output Compute principal coordinates
Center
. Sometimes, we want to normalize , but only when different features have different measurement units.Compute unit eigenvectors and eigenvalues of
. From these, we can usually choose based on eigenvalue sizes.Pick the best
eigenvectors as the ones with the largest eigenvalues: this forms our -dimensional subspace .Compute principal coordinates
for each point in the training data, and each subspace eigenvector . These give us coordinates of each in principal component space.
One thing to remember: we centered the training data beforehand. We need to apply the same thing to test data. However, there is an alternative: we can un-center the training data before we project them onto PC space.
Let’s see an example that shows the difference normalizing data could make in PCA:
So in this example, our
When do we choose if we want to scale or not? Totally application-dependent. Should low-frequency events like murder and rape have a disproportionate (bigger) influence on PCA axes? If yes, then you probably want to scale.
Of course, with more eigenvectors and eigenvalues (more dimensions), we get more variance captured from the original dataset. We can calculate the percent of variance we capture by dividing the sum of our eigenvalues used by all the eigenvalues of
Note
If using PCA for preprocessing in supervised learning, then like always, using (cross) validation is a better way to choose
We can also think of PCA as finding the optimal directions that keep most of our sample variance after we project our data down. In other words, after projection, we want to keep our points as spread out as possible, as indicated in the original high dimensions.

So here, we are projecting white points in 2D onto 1D (the green line) as blue points. We want to maximize sample variance between those two points. We need to choose an orientation of this green line (choose our direction vector) that does this. Mathematically, our goal can be represented as:
We can simplify this down further as:
The fraction
First, note that if
Typically we want
Yet another way to think about PCA is finding

In both methods, though, we are still minimizing the sum of the squares of the projection distances for each point. This is equivalent to maximizing variance.