🃏

01 Multivariate Random Variables

TOC

1. Essentials in multivariate statistics

Multivariate models provide a means to learn dependencies and interactions among the components of the random variables which in turn allow us to draw conclusion about underlying mechanisms of interest.
Two main tasks of the course:
  • unsupervised learning (finding structure, clustering)
  • supervised learning (training from labelled data, followed by prediction)
Challanges:
  • complexity of model needs to be appropriate for problem and available data
  • high dimensions make estiomation and inference difficult
  • computational issues

1.1 Mean of a random vector

The mean / expectation of a random vector with dimensions d is also a vector with dimensions d

1.2 Variance of a random vector

Definition of variance for univariate random variable:
Definition of variance for multivariate random variable:
The variance is a matrix - Covariacne Matrix

1.3 Properties of the covariance matrix

  • is real valued:
  • is symmetric:
Number of separate entries:
Note:
  • For large dimension d the covariance matrix has many compoents ⇒ computationally expensive (both for storage and in handling) ⇒ very challenging to estimate in high dimension d
  • Matrix inversion requires operations using standard algorithms such as Gauss Jordan elimination. Hence, computing is computationally expensive for large d

1.4 Eigenvalue decomposition of

A symmetric matrix with real entries has real eigenvalues and a complete set of orthogonal eigenvectors. Thus
where is an orthogonal matrix containing the eigenvectors and
contains the eigenvalues
Because for a non-zero non-random vector :
therefore, the covariance matrix is always positive and semi-definite
In fact, unless there is collinearity, all eigenvalues will be positive and is a positive semi-definite

1.5 Quantities related to the covariance matrix

Correlation matrix
is a symmetric matrix ()
variance-correlation decomposition:
where is a diagonal matrix containing the variances:
thus (the definition of correlation written in matrix notation)

1.6 Precision matrix or concentration matrix

The inverse of the covariance matrix can be obtained via the spectral decomposition, followed by inverting th eigenvalues
all eigenvalues need to be positive so that can be inverted (i.e. needs to be positive definite)
Importance of :
  • Many expressions contain it
  • it has close connection with graphical models
  • it is a natural parameter from an exponential family perspective

1.7 Total variation and generalised variance

Two commonly used measures that summarise the covariance matrix in a single scalar value
  • total variation:
  • generalised variance:
The generalised variance is also known as the volume of

2. Multivariate normal distribution

2.1 Univariate normal distribution

dimension d = 1,
Density:
Special case: standard normal with and
Differential entropy
Cross-entropy
KL divergence
Maximum entropy characterisation:
the normal distribution is the unique distribution that has the highest (differential) entropy over all continuous distributions with support from to with a given mean and variance
This makes the normal distribution important and useful. If we only know that a random variable has a mean and variance, and not much else, then using normal distribution will be a reasonable and well justified working assumption.

2.2 Multivariate normal model

Dimension d,
Density:
  • the density contains precision matrix
  • inverting implies that
Special case: standard multivariate normal with
which is equivalent to the product of d univariate standard normals
  • for d = 1, multivariate normal reduces to normal
  • for diagonal (, no correlation), MVN is the product of univariate normals
Differential entropy
Cross-entropy
KL divergence
Shape of the multivariate normal density

2.3 Three types of covariances

A covariance matrix can be parameterised in terms of: volume, shape, orientation
where and
The eigenvalues of are
  • Volume: , determined by a single parameter
  • Shape: determined by , with d-1 free parameters
  • Orientation: is given by the orthogonal matrix , with free parameters
Type 1: spherical covariance , with spherical contour lines, 1 free parameter ()
Example: with
notion image
Type 2: diagonal covariance , with elliptical contour lines and axes of the ellipse oriented parallel to the coordinates, d free parameters ()
Example: with
notion image
Type 3: general unrestricted covariance , with elliptical contour lines, with axes of the ellipse oriented according to the column vectors in , d(d+1)/2 free parameters
Example: with
notion image

2.4 Concentration of probability mass for small and large dimension

The density of the multivariate normal distribution has a bell shape with a single mode. Thus it looks as if all probability mass is always concentrated around his peak. This is true for small dimensions while incorrect for high dimensions
notion image
Actually, only for dimensions up to around d = 10 is the probability mass concentrated the bell in the center but from d = 30, it has moved completely to the tail of the distribution

3. Estimation in large and small sample settings

3.1 Strategies for large sample estimation

Empirical estimators
for large n, there exist law of large numbers:
When we would like to estimate which is a function of , i.e. like mean, median and some other quantity.
The empirical estimate is obtained by replacing the unknown true distribution with the observed empirical distriution: , for examplee:
Maximum likelihood estimation
likelihood = prbability to observe data given the model parameters
For large sample size, no estimator can be constructed that outperforms the MLE

3.2 Large sample estimates of mean and covariance

Empirical estimates:
Matrix notation:
where , thus
Variance estimate:
Note the factor not
Maximum likelihood estimates:
MLE of the parameters and of the multivariate normal distribution. The corresponding log-likelihood function
Let , then
Since
Note that is symmetric. Setting this equal to zero, we get and thus
Besides
Set it equal to zero, thus
The MLEs are identical to the empirical estimates.
Note that the factor is still not
Distribution of the empirical / maximum likelihood estimates
With , one can find the exact distributions o f the estimators
  1. Distribution of the estimate of the mean:
Since is unbiased.
  1. Distribution of the covariance estimate:
Since is biased, thus is unbiased

3.3 Problems with MLE in small sample settings and high dimensions

Data sets with i.e. high dimension and small sample size are now common in many fileds like medicine and finance.
General problems of MLEs:
  1. MLEs are optimal only if sample size is large compared to the number of parameters.
  1. If there is a choice between different models with different complexity ML will always select the model with the largest number of parameters.
Modern statistics’s methods for high-dimensional & small sample settings:
  • regularised estimators
  • shrinkage estimators
  • penalised MLE
  • Bayesian estimators
  • Empirical Bayes estimators
  • KL / entropy-based estimators

3.4 Estimation of covariance matrix in small sample settings

Problems with MLE’s :
  • has number of parameters. Therefore, requires a lot of data
  • If , then is positive semi-definite, thus will have vanishing eigenvalues (some ) and thus cannot be inverted and is singular. And it cannot meet the need of many equations which includes
Making the MLE of invertible
By adding a small number (say ):
we get an invertible matrix. However, while this simple regularisation results an invertible matrix the estimator itself has not improved over the MLE, and will also be poorly conditioned. (large condition number)
Simple Bayes-type regularised estimate of
Regularised estimator = convex combination of and (identity matrix, also the target)
Regularisation introduces bias and reduces variance, minimising overall MSE
Regularisation introduces bias and reduces variance, minimising overall MSE
target (prior information) helps to infer even in small samples
One way to get is to define , where
However, since we don’t know the true we cannot actually compute the MSE directly but have to estimate it. In practice:
  • By cross-validation (some as target, and is assumed to be the “true” )
  • By using some analytic approximation

4. Categorical and Multinomial distribution

The multivariate generalisations of Bernoulli and Binomial distribution

4.1 Categorical distribution

The categorical distribution is a generalisation of the Bernoulli distribution and is correspondlingly also known as Multinoulli distribution
Assume there are K classes labelled as . A discrete random variable with a state space consisting of these K classes has a categorical distribution . specifies the probabilities of each of the K classes with . , hence there are K-1 independent parameters in a categorical distribution.
one hot coding: a numerical way to represent a sampling from a categorical distribution.
  • The expectation of is , with C
  • The covariance matrix is
  • Component notation:
The corresponding probability mass function (pmf) is
the log pmf is
In order to be more explicit that the categorical distribution has but not parameters, we rewrite the log-density with and , thus
Any can be chosen as
For , the categorical distribution reduces to the Bernoulli distribution with

4.2 Multinomial distribution

Repeat Bernoulli experiment n times we get binomial distribution. Similarly, repeat categorical samplings we get multinomial distribution
Draw times from categorical distribution :
standardised to unit interval:
distribute n balls into K bins
distribute n balls into K bins

4.3 Entropy and maximum likelihood analysis for the categorical distribution

KL divergence
With and and corresponding probabilities and satisfying and we get:
To be explicit that there are only parameters in a categorical distribution we can also write
with and
Expected Fisher information
We first compute the Hessian matrix of the log-probability mass function, where the differentiation is with regard to
The diagonal entries of the Hessian matrix (with ) are:
and its off-diagonal entries are (with )
Since , thus the Fisher information matrix for a categorical distribution is
For and , this reduces to the expected Fisher information of a Bernoulli variable
Quadratic approximation of KL divergence
The expected Fisher information arises from a local quadratic approximation of the KL divergence:
and
Consider the KL divergence between the categorical distribution with probabilities with the categorical distribution with probabilities
Keep fixed and assume that is a pertuibed version of with . And the perturbations satisfy beacause sum of equals to sum of equals to 1. Thus . Then
Similarly, if we keep fixed and consider as a disturbed version of , we get
The Neyman divergence is also known as the reverse Pearson divergence as
MLE of the categorical distribution
Data: We observe samples . The data matrix of dimension is . It contains each
The log-likelihood is
Score function (gradient)
Setting yields equations
for the solution is
And thus
Observed Fisher information
First compute the negative Hessian matrix of the log likelihood function and then evaluate it at the MLEs
The diagonal entries of the Hessian matrix (with ) are
off-diagonal entries are (with )
Thus, the observed Fisher information matrix at the MLE for a categorical distribution is
For , this reduces to the observed Fisher information of a Bernoulli variable
The inverse of the observed Fisher information is:
The inverse is derivated by the equation:
with
  • and its inverse
Then and . With this
and
For , the inverse observed Fisher information of the categorical distribution reduces to that of the Bernoulli distribution
The inverse observed Fisher information is useful. e.g. as the asymptotic variance of the maximum liklihood estimate.
Wald statistic for the categorical distribution
The squared Wald statistic is
with the observed counts with and and the expected counts under we can write the squared Wald statistic as
This is known as the Neyman chi-squared statistic and is asymptotically distributed as because there are free parameters in

5. Further multivariate distributions

5.1 Dirichlet distribution

Univariate case - Beta distribution
Different shapes
notion image
Multivariate case - Dirichlet distribution

5.2 Wishart distribution

Univariate case - Gamma distribution
Then is distributed as:
where is the shape and is the scale parameter of the gamma distribution
The mean and variance of are:
Useful as the distribution of sample variance:
Known mean :
Unkown mean (estimated by ):
Multivariate case - Wishart distribution
Then (a random matrx) is distributed as:
with mean and variances:
Useful as distribution of sample covariance:

5.3 Inverse Wishart distribution

Univariate case - Inverse gamma distribution
Then has mean and variance
Relationship to gamma distribution:
Multivariate case - Inverse Wishart distribution
Relationship to Wishart:
 

Loading Comments...