♟️

MLII-3

PCA

Basic Idea
We are given a data matrix meaning that we have observations (rows) and features (columns). We assume that the columns of have been centered.
The first principal component direction of is the unit vector that maximizes the sample variance of when compared to all other unit vectors. Since we already have column centered , for any , the vector has sample mean zero and sample variance . Therefore, the first principal component direction is
Given the principal component directions (which are orthonormal), we define the -th principal component direction to be
The vector is called the -th principal component score of (projection along the -th principal vector), and is the normalized -th principal component score. Here and is the amount of variance explained by .
Matrix Approximation (dimension reduction)
Consider approximating by , the projection of onto the first principal component directions. Given centered , if is the matrix whose columns contain the first principal component directions of , then
That is, is the best rank approximation to . And the reconstruction error is
Note that due to the definition of principal vectors, PCA finds the best rank approximation of in a least-squares sense, by projecting onto the top singular vectors of .
Computing Principal Components
Firstly center the columns of .
Eigenvalue Decomposition Method:
Write , where the columns of are the eigenvectors and is the diagonal matrix of eigenvalues, then
  • the columns of , are the principal component directions
  • the elements are the amounts of variation explained
  • principal component score is computed with
SVD Method:
Singular value decomposition (SVD) of is
where is the diagonal with , and both have orthonormal columns. After the decomposition
  • columns of , are the principal component directions
    • note that , thus are eigenvectors of
  • columns of , are the normalized principal component scores
    • note that , thus
  • squaring the -th diagonal element of and dividing by , gives the variance explained by .

NMF

One interpretation of the SVD could be as decomposing into two matrices
with , , and
  • if and is full column rank, then
  • we can truncate the SVD, just keeping terms, to get a simplified version of
Non-negative Matrix Factorization (NMF) decompose a matrix , forcing , i.e.
The non-negativity constraint automatically leads to sparsity in and . Note that while PCA finds orthonormal basis vectors i.e. the components are constrained to be orthogonal, NMF enforces non-negativity i.e. all elements of and are non-negative but not necessarily orthogonal.
The NMF can be used as a kind of soft clustering. It assign each observation to only a few clusters (soft), with specifying how much of piece makes up observation .

Mixture Model

Suppose that we want to describe a distribution as a sum of other known distributions
where , and each may have different parameters that must be estimated.
The log-likelihood of a mixture model is the log of the sum of the likelihoods of the component mixtures. If we have data , then
In general, the maximum has no analytical solutions. We need numerical methods, e.g., the Expectation Maximization (EM) algorithm.
Suppose we know where if data point was drawn from mixture component . Then we have
But we actually do not know .
(E-step) Instead, we can calculate the :
The probabilities are called the responsibilities, which are calculated by
(M-step) Once we have calculated , plug the numbers in and maximize the analytically:
The overall EM procedure is as follows:
(1) choose model and number of mixture components
(2) choose initial guesses for all parameters, including
  • As for the model parameters, randomly initialize them or use some strategy
  • As for the , we can randomly initialize them, or uniformly initialized. We can also use the K-means method the initialize them, i.e. firstly apply a classes K-means clustering to the dataset, and then calculate the according to the clustering results. Note that Mixture Models are label switching.
(3) Perform an E step and calculate the responsibilities
(4) Perform an M step and get new parameter estimates that maximize
(5) Repeat (3) and (4) until parameters stop changing (or the log-likelihood only slightly or changes less than a threshold, e.g., 10e-6)
 

Loading Comments...