Analysing Handwritten Digits Using Principal Component Analysis

I do quite a few projects which get a few cool graphs in them but no interesting conclusions or discoveries, and so instead of just leaving them to rot in my ‘odd_projects’ folder, I thought I’d start publishing some short posts outlining what i did (like really just an outline, I probably wont go very deep into the theory) and sharing the graphs, so here goes:

What I have: The MNIST database, a database of 70,000 handwritten digits labelled by what number they’re meant to be

What I’m going to do with it: Use principal component analysis to compare relative difficulties of classifying handwritten digits

Intro

We’ve got a bunch of digits like this in the MNIST database:

4

they are drawn by many different people who all have a different concept of what a 4 is, it could have a connected upper part like the 4 in this font does, or they could go more like the one above. In any case there are plenty of different ways to write a 4 and we want to be able to read a digit and be able to classify what that digit is using computers.

One problem with this is that when we read in a digit from MNIST, it is a 28×28 grid of pixels which corresponds to 784 variables, all which can take any value from 0 to 255 (denoting how dark the pixel is). Trying to classify directly in this form would be difficult, as the density of training examples (even 60,000) will spread over the space so sparsely that any classifying power you had over new digits would be based almost entirely on luck.

I used principal component analysis to reduce the dimensionality of the problem. PCA uses the covariance matrix which describes how two pixels tend to vary with each other, and finds the eigenvalues and eigenvectors of this matrix. PCA ranks these eigenvectors by the value of their corresponding eigenvalues to find the vectors (corresponding to pixel arrays) along which the dataset varies the most. Since the covariance matrix is symmetric, the vectors are orthonormal which means we can create a basis out of them

By picking the top n vectors we can try to recreate our pixel images by

\vec{x_i}=\sum_j \alpha_{ij} \vec{y_{j}}

where \vec{y_j} are the basis functions chosen by PCA and \vec{x_i} is the input digit we’re trying to classify. Since the basis functions are orthonormal, we can find the \alpha_{ij}=\vec{y_j}\cdot\vec{x_{i}}. These alphas are our new variables which we can use to classify the digits.

7s and 4s

For the purpose of illustration, I’m taking only  the digits 7 and 4 and trying to determine between them. The first two eigenvectors look like so:

1 0

 

If you squint hard enough, you can see that in these vectors the blue shape (negative weights) are shapes which are consistent with the number 7 and the red shapes (positive weights) are those which are more consistent with a 4. What that means is that if we take the alphas for each dimension and plot them on a scatter graph, we should see the 7’s in the bottom left hand corner (where the components are negative) and the 4’s in the top right. Plotting them , we see exactly that:

separationYou can see that even with only two variables it is nearly possible to distinguish between a 7 and 4.

5’s and 6’s

Suppose we tried the same analysis on two digits that looked similar, 5’s and 6’s. We would expect the PCA to not distinguish as well between them. Lets see.

First up is our vectors:

12 02

We see a similar thing here where you can see dominant features of the 5 and of the six, however on the left image you can see a point of potential confusion: the 5 it draws out could easily be turned into a 6 by just joining the leftmost lines up. This means it would be easy to make a 6 that followed the shape of the 5.

This confusion shows up in the scatterplot:

separation2

while some of the fives and sixes are separated, there are many cases where there is no clear distinction between the two, and that will lead to many misclassifications.

As I said at the start, no conclusions or discoveries allowed, so here’s where we end!