πŸƒ

05 Multivariate Dependencies

TOC

1. Linear association between two sets of random variables

Aim
We assume a joint correlation matrix
with cross-correlation matrix and the within-group correlations . To characterise the total association between and , we are looking for a scalar quantity measuring the divergence.
Special Cases
If y is univariate, this measure should reduce to the squared multiple correlation or coefficient of determination
which describes the strength of total linear association between the predictors and the response .
Note that if the marginal correlations vanish (), then . If the correlation between the predictors vanishes (), then , i.e. the sum of the squared marginal correlations.
If there is only a single predictor then and and then squared multiple correlation reduces to the squared Pearson correlation

2. Canonical Correlation Analysis (CCA)

CCA aims to characterise the linear dependence between two random vectors and by a set of canonical correlations .
CCA works by simultaneously whitening two random vectors and where the whitening matrices are chosen in such a way that the cross-correlation matrix between the resulting whitened variables becomes diagonal, and the elements on the diagonal correspond to the canonical correlations.
notion image
Cross-correlation between and :
with
We can choose suitable orthogonal matrices so that
notion image
where are the canonical correlations and
To make diagonal, we can use singular value decomposition (SVD) of matrix :
where is the diagonal matrix containing the singular values of . This yields orthogonal matrices and thus the desired whitening matrices .
As a result, i.e. singular values of are the desired canonical correlations. Note that the signs of corresponding columns in are not identified. Traditionally, in an SVD the signs are chosen such that the singular values are positive. However, if we impose positive-diagonality on and thus positive-diagonality on the cross-correlations, then the canonical correlations may take on both positive and negative values.

3. Vector correlation and RV coefficient

3.1 Vector alienation coefficient

The vector alienation coefficient is defined as
With , the vector alienation coefficient can be written as
where the are the singular values of , i.e. the canonical correlations for the pair and . The vector alienation coefficient is computed as a summary statistic of the canonical correlations.
If then , thus the vector alienation coefficient .

3.2 Rozeboom vector correlation

Squared vector correlation is the complement of the vector alienation coefficient defined as
If then and hence
If either or , the squared vector correlation reduces to the corresponding squared multiple correlation, which in turn, for both and becomes the squared Pearson correlation.

3.3 RV coefficient

Another common approach to measure association between two random vectors is the RV coefficient defined as
The main advantage of the RV coefficient is that it is easier to compute than Rozeboom vector correlation as it uses the matrix trace rather than the matrix determinant.
ForΒ  the RV coefficient reduces to the squared correlation. However, the RV coefficient does not reduce to the multiple correlation coefficient forΒ  andΒ p>1, and therefore the RV coefficient cannot be considered a coherent generalisation of Pearson and multiple correlation to the case whenΒ x andΒ y are random vectors.

3.4 Limits of linear models and correlation

Correlation measures only linear dependence
has decisive relation but the correlation is almost zero.
Anscombe data sets
Using correlation, and more generally linear models blindly can easily hide the underlying complexity of the analysed data. An example is shown as below
notion image
Although relationship between x and y is very different in the four cases, they share exactly the same linear characteristics.
Thus, in actual data analysis, it is always a good idea to inspect the data visually to get a first impression whether using a linear model makes sense.

4. Mutual information as generalisation of correlation

Mutual information (MI) is a more general way than the vector correlation to measure multivariate association. MI not only covers linear but also non-linear associations.

4.1 Definition

KL divergence is defined as below:
where is the reference distribution and is an approximating distribution, with and being the corresponding density functions.
The Mutual Information (MI) between two random variables and is defined as KL divergence between the corresponding joint distribution and the product distribution:
Thus, MI measures how well the joint distribution can be approximated by the product distribution (which would be the appropriate joint distribution if and are independent). implies that the joint distribution and product distributions are the same. Hence the two random variables and are independent if the mutual information vanishes.

4.2 MI between two normal scalar variables

The KL divergence between two multivariate normal distributions and is
This allows compute the mutual information between two univariate random variables and that are correlated and assumed to be jointly bivariate normal. Let then and the covariance matrix:
where . If are independent then and
The product
has trace and determinant
With this the MI between and can be computed as

4.3 MI between two normally distributed random vectors

Let with dimension . The joint multivariate normal distribution is characterised by the mean and the covariance matrix
If and are independent then and
The prodcut
has trace and determinant
with . With , the singular values of (i.e. the canonical correlations between and ) we get
The mutual information between and is then
Β 
By comparison with the squared RV correlation , we recognize that
Thus, RV correlation is directly linked to mutual information for jointly multivariate normally distributed variables.

4.4 MI for variable selection

Intriguingly, the expected KL divergence between the conditional and the marginal distribution
is equal to mutul information between and . Thus measures the impact of conditioning. If the MI is small (close to zero), then is not useful in predicting . Veification shown as below:
since
thus
Because of this link of MI with conditioning, the MI between response and predictor variables is often used for variable and feature selection in general models.

Summary

In principle, MI can be computed for any distribution and model and thus applies to both normal and non-normal models, and to both linear and nonlinear relationships.
Β 

Loading Comments...