1. MNIST database Introduction
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems (Wikipedia). Each image in MNIST database contains 28*28 pixels and each pixel takes an integer value between 0 and 255.
The MNIST database contains 60,000 training images and 10,000 testing images. In this coursework, we are carrying out PCA analysis on the test set.
Load and preview the data set
load("mnistTest.rda") # rda file root ## Image Preview (ref: coursework instructions) par(mfrow=c(2,5)) for (k in 1:10) # first 10 images { m = matrix( mnistTest$x[k,] , nrow=28, byrow=TRUE) image(t(apply(m, 2, rev)), col=grey(seq(1,0,length=256)), axes = FALSE) } mnist6 = mnistTest$x[mnistTest$y==6,] # select just the 6s par(mfrow=c(2,5)) for (k in 1:10) # first 10 images of 6 { m = matrix(mnist6[k,] , nrow=28, byrow=TRUE) image(t(apply(m, 2, rev)), col=grey(seq(1,0,length=256)), axes = FALSE) }
2. PCA Introduction
Principal component analysis (PCA) is a matrix based technique for analysing datasets. It is a popular technique for analyzing large datasets containing a high number of demensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information (Wikipedia).
In mathematical language, PCA is trying the find the “principal components” which have the relationship with original vectors as below:
Where satisfies:
The total variation is . With principle components, the fraction can be interpreted as the proportion of variation contributed by each component in to the total variation. And thus we can discard low ranking components in which have low variation (also reflect less information), leading to a reduction in dimension.
For a data matrix , the equation can be written as below:
3. PCA Analysis
The following analysis will use libraries of “ggplot2” and “factoextra”, thus we should import them first
library('ggplot2') library("factoextra")
Compute the 784 principal components from the 784 original pixel variables
res.pca = prcomp(X)
Compute and plot the proportion of variation attributed to each principal component.
eig.val = get_eigenvalue(res.pca) fviz_eig(res.pca, addlabels=TRUE)
Scatter plot of the first two principal components. (Use the known labels to colourise the scatter plot.)
fviz_pca_ind(res.pca, mean.point=F, label = "none", geom.ind = "point", habillage = mnistTest$y, #addEllipses = TRUE # add Elliptic boundary line )
Interpretation and discussion
From Fig.3 we can observe that even the principal component 1 and principal component 2 still have low proportion of the total variation. This is because PCA is essentially a linear transformation (as equation shown), which only uses the information of the first and second order moments, and does not use the information of the higher order moments. The contribution of each variable on PC1 and PC2 are shown as below
fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07") )
PCA is trying to find “correlations” and project original variables to principal components. However, the “correlations” are only linear correlations which cannot fully exploit the information contained in data.
From Fig.5, we can observe that the distinction between different categories is not significant based on PC1 and PC2. It is partly because of the above reason. Besides, in PCA, the principal components is sorted by “proportion of total variation”. The assumption here is that the bigger the proportion, the more information is contained in the principal component. However, principal component with low proportion may also include rich information.
4. References
- Wikipedia:
Loading Comments...