πŸƒ

Multivariate Statistics and ML - Cheat Sheet

Exexpectation and Variance Definition
Multivariate normal model
Empirical estimates
Maximum Likelihood estimates
, is always positive and semi-definite. All eigenvalues need to be positive so that can be inverted (i.e. needs to be positive definite) or use to ensure.
Total variation:
Generalised variance: , also called the volume of
Correlation matrix: , ;
Contour lines of bivariate normal distributions
. : spherical contour lines; (diagnol): elliptical contour lines and axes of the ellipse oriented parallel to the coordinates. (long/short = ); Others: elliptical contour lines, with axes of the ellipse oriented according to the column vectors in (long/short = )
Location-scale (affine) transformation:
; ;
; ,
and are diagonal matrices.
Affine trans is invertible when
Manhalanobis transform:
Inverse Mahalanobis transformation:
After linear transformation, pdf:
Whitening transformation: , . Constraint on whitening matrix:
Parameterisation: orthogonal matrices.
Cross-cov: , is maximised for ; Cross-corr: , is maximised for .
ZCA (Mahalanobis)-whitening: each latent component should be as close as possible to the corresponding original variable , just remove correlations. Objective function: minimize
Only is a funtion of , equivalent objective function: maximize to find the optimal β‡’ . Total variation contributed ratio of :
ZCA-Cor-whitening: same as above but remove scale in first before comparing to . Objective function: minimize
Only is a function of , equivalent objective function: maximize maximise to find optimal β‡’ . Total variation contributed ratio of :
PCA-whitening: remove correlations and at the same time compress information into a few latent variables (descend importance). ,
Objective function: find an optimal so that the resulting set in majorizes any other set of relative contributions. Apply Schur’s theorem, through eigendecomposition of , optimal value for is ; ; ; . is not uniquely defined, we are free to change the colums signs, thus (, , , ) is not unique. Total variation contributed ratio of :
PCA-cor-whitening: same as PCA whitening but remove scale in first. , . Objective function: find an optimal so taht the resulting set in majorizes any other set of relative contributions.
Apply Schur’s theorem, through eigendecomposition of , optimal value for is ; ; ; . As with PCA whitening, there are sign ambiguities in the above because the column signs of can be freely chosen. Total variation contributed ratio of :
Β 
Cholesky whitening: Find a whitening transformation such that the cross-covariance and cross-correlation have lower triangular structure. Apply cholesky decomposition to . is a lower triangular matrix with positive diagonal elements. Its inverse is also lower triangular with positive diagonal elements. . ; ; ; ; . Cholesky whitening depends on the ordering of input variables. Each ordering of the original variables will yield a different triangular constraint and thus a different Cholesky whitening transform.
PCA transformation: random vector with . PCA: , where instead of . PCA is not a whitening procedure but is closely linked to PCA whitening: . principle components’ contribution to variation . discard low ranking components to reduce dimension. Application: : 1. Estimate , then with eigenvalue decomposition on it to get ; 2. and . with SVD of to get .
divisive or recursive partitioning algorithms: grow the tree from the root downwards, first determine the main two clusters, then recursively refine the clusters further
Agglomerative hier cluster: grow the tree from the leaves upwards, successively form partitions by first joining individual object together, then recursively join group of items together, until all is merged. Eucliden distance: ; Squared Eucliden distance: ; Manhattan distance: ; Maximum norm: . Distance between two sets: Complete linkage (max distance): ; Single linkage (min distance): ; Average linkage (avg. distance): . Ward’s cluter: , .
K-means: SSW (total unexplained sum of squares): ; SSB (explained sum of squares) ; SST (total sum of squares): . T=B+W. Choose K when B is not huge worse than bigger K. K-medoids (PAM): select on from group not mean. Not just Euclidean distance.
Finite mixture model:
; GMM: ; ; ;
Penalised likelihood model: ; . ML estimate:
EM algorithm
E-step: ; ; M-step:
EM for GMM: E-step: ; M-step: ; ; ;
-means can be viewed as an EM type algorithm to provide hard classification based on a simple restricted Gaussian mixture model. (, ). Then and .
Invariant States of GMM: . M-step and estimates are . . Then (soft allocation)
ML of GMM: Complete data likelihood:
observed data likelihood:
Problems of GMM ML: label switching and non-identifiability of cluster labels; GMM is singular if one of the fitted covariance matrices becomes singular.
Bayesian discriminant rule: β‡’ β‡’ , . Softmax:
QDA: assume . Discriminant function: . The QDA discriminant function is quadratic in . This implies that the decision boundaries for QDA classification are quadratic.
LDA: based on QDA, assume . Discriminant function: . drop which doesn’t depend on , get .
DDA: based on QDA, assume . Discriminant function:
Computation cost: For QDA, LDA and DDA, we need to learn with and the mean vectors ; For QDA, we additionally require β‡’ ; For LDA, we need β‡’ ; For DDA, we estimate β‡’ .
Regularisation: Modern statistics has developed many different (but related methods) for use in high-dimensional small sample settings: regularised estimators, shrinkage estimators, penalised maximum likelihood estimators, Bayesian estimators, Empirical Bayes estimators, KL / entropy-based estimators.
Squared multiple correlation:
CCA: CCA aims to characterise the linear dependence between two random vectors and by a set of canonical correlations by simultaneously whitening two random vectors and . . Whitening of : ; Whitening of : . Cross-correlation between and : with . To make diagonal, we can use singular value decomposition (SVD) of matrix : , where is the diagonal matrix containing the singular values of . This yields orthogonal matrices and thus the desired whitening matrices .
Vector alienation coefficient: . With , the vector alienation coefficient can be written as where the are the singular values of , i.e. the canonical correlations for the pair and . The vector alienation coefficient is computed as a summary statistic of the canonical correlations. If then , thus the vector alienation coefficient .
Rozeboom vector correlation: Squared vector correlation is the complement of the vector alienation coefficient defined as . If then and hence . If either or , the squared vector correlation reduces to the corresponding squared multiple correlation, which in turn, for both and becomes the squared Pearson correlation.
RV coefficient: defined as . The main advantage of the RV coefficient is that it is easier to compute than Rozeboom vector correlation as it uses the matrix trace rather than the matrix determinant. ForΒ  the RV coefficient reduces to the squared correlation. However, the RV coefficient does not reduce to the multiple correlation coefficient forΒ  andΒ p>1, and therefore the RV coefficient cannot be considered a coherent generalisation of Pearson and multiple correlation to the case whenΒ x andΒ y are random vectors.
Mutual information (MI): . Thus, MI measures how well the joint distribution can be approximated by the product distribution (which would be the appropriate joint distribution if and are independent). implies that the joint distribution and product distributions are the same. Hence the two random variables and are independent if the mutual information vanishes.
MI between two normal scalar variables: , , has trace and determinant .
MI between two normal random vectors: , , , , with . With , the singular values of (i.e. the canonical correlations between and ) we get . The mutual information between and is then By comparison with the squared RV correlation , we recognize that . Thus, RV correlation is directly linked to mutual information for jointly multivariate normally distributed variables.
MI for variable selection: the expected KL divergence between the conditional and the marginal distribution is equal to mutul information between and β‡’ . Thus measures the impact of conditioning. If the MI is small (close to zero), then is not useful in predicting . Since , thus . Because of this link of MI with conditioning, the MI between response and predictor variables is often used for variable and feature selection in general models.
MI & correlation: MI can be computed for any distribution and model and thus applies to both normal and non-normal models, and to both linear and nonlinear relationships. Correlation only measures linear dependence and face anscombe data sets problem.
Β 

Loading Comments...