Motivation: The transformation are very important since they either transform simple distributions into more complex distributions or allow to simplify complex models. In manchine learning invertible mappings or transformations for probability distributions are known as βnormalising flowsβ
TOC
1. Linear Transformations1.1 Location-scale transformation1.2 Squared multiple correlation1.3 Invertible location-scale transformation1.4 Transformation of a density under an invertible location-scale transformation2. Nonlinear transformations2.1 General transformation2.2 Delta method2.3 Transformation of a pdf under a general invertible transformation3. General whitening transformations3.1 Overview3.2 Whitening transformation and whitening constraint3.3 Parameterisation of whitening matrix3.4 Cross-covariance and cross-correlation for general whitening transformations3.5 Inverse whitening transformation and loadings 3.6 Summaries of Phi and Psi resulting from Whitening transformations4. Natural whitening procedures4.1 ZCA whitening4.2 ZCA-Cor Whitening4.3 PCA whitening4.4 PCA-cor whitening4.5 Cholesky whitening4.6 Comparison of whitening proceduresSummary5. Principal Component Analysis (PCA)5.1 PCA transformation5.2 Application to data5.3 Iris flower data example5.4 PCA correlation loadings5.5 PCA correlation loadings plot
1. Linear Transformations
1.1 Location-scale transformation
Also knwon as affine transformation.
where:
- random vector
- vector, location parameter
- matrix, scale parameter,
- random vector
Mean and variance of the original vector :
Mean and variance of the transformed random vector :
Cross-covariance: between and :
Cross-correlation between and :
where and are diagonal matrices containing the variances for the components of and . The dimensions of the matrix are also
Examples
- Univariate case ():
- Sum of two random univariate variables: i.e. and
- ,
note that
1.2 Squared multiple correlation
Squared multiple correlation is a scalar measure summarising the linear association between a scalar response variable and a set of predictors . It is defined as
If can be perfectly linearly predicted by then
The empirical estimate of is the coefficient.
Squared multiple correlation for affine transformation
Since we linearly transform into with no additional error involved we expect that for each component in , we have . This can be shown directly by computing:
1.3 Invertible location-scale transformation
If (square ) and then the affine transformation is invertible.
Forward transformation:
Back transformation:
Invertible transformations thus provide a one-to-one map between and .
Manhalanobis transform
Assume and a positive definite covariance matrix with
The Mahalanobis transformation is given by:
This corresponds to an affine transformation with and
The inverse principal matrix square root can be computed by eigendecomposition.
The mean and variance of becomes:
and
The Mahalanobis transforms performs three functions:
- Centering ()
- Standardisation
- Decorrelation
In the univariate case (d=1), the coefficients reduce to and and the Mahalanobis transform becomes
which is centering + standardisation
Inverse Mahalanobis transformation
The inverse of the Mahalanobis transform is given by
As the Mahalanobis transform is a whitening transform the inverse Mahalonobis transform is sometimes called the Mahalanobis colouring transformation. The coefficients in the affine transformation are and
Starting with and the mean and variance of the transformed variable are
1.4 Transformation of a density under an invertible location-scale transformation
Assume with density
After linear transformation we get with density
Example
Transformation of standard normal with inverse Mahalanobis transform
Assume is multivariate standard normal with density
Then the density after applying the inverse Mahalanobis transform is
β has multivariate normal density
Application: e.g. random number generation: draw from then convert to multivariate nomal by transformation π¬
2. Nonlinear transformations
2.1 General transformation
with an arbitrary vector-valued function. In linear case, that is
2.2 Delta method
In general, for a transformation the exact mean and variance of the transformed variable cannot be obtained analytically. That is and are not always the cases.
However, we can find a linear approximation and then compute its mean and variance. The approximation method is called βDelta Methodβ.
Linearisation of is achieved by a Taylor series approximation of first order of around :
If is scalar-valued then gradient is given by the vector of partial correlations
where is the nabla operator
The Jacobian matrix is the generalisation of the gradient if is vector-valued:
Since the linear approximation has and and leads directly to the multivariate Delta method:
The univariate Delta method is a special case:
Note that the Delta approximation breaks down if is singular, for example if the first derivative (or gradient or Jacobian matrix) at is zero.
Example: Variance of the odds ratio
The proportion resulting from repeats of a Bernoulli experiment has expectation and variance . The approximate mean and variance of the corresponding odds ratio
Since , and
Example: Log-transform as variance stabilisation
Assume has some mean and variance , i.e. the standard deviation is proportional to the mena . The mean and variance of thelog-transformed variable are:
. Using Delta method: and
Thus, after applying the log-transform the variance does not depend any more on the mean.
2.3 Transformation of a pdf under a general invertible transformation
Assume is invertible: , with probability density function
The density of the transformed random vector is then given by
where is the Jacobian matrix of the inverse transformation.
Special cases:
- Univariate version:
- Linear transformation , with and
3. General whitening transformations
3.1 Overview
Whitening transformations are a special and widely used class of invertible location-scale transformations.
Terminology: whitening refers to the fact that after the transformation the covariance matrix is spherical, isotropic, white ()
Whitening is useful in preprocessing, as they allow to turn multivariate models into uncorrelated univariate models (via decorrelation property). Some whitening transformations reduce the dimension in an optimal way (via compression property)
The Mahalanobis transform is a specific example of whitening transformation. It is also know as βzero-phase component analysisβ or short ZCA transform.
In latent variable models, whitening procedures link observed (correlated) variables and latent variables (which typically are uncorrelated and standardised):
3.2 Whitening transformation and whitening constraint
Random vector not necessarily from multivariate normal.
has mean and a positive definite (invertible) covariance matrix
The covariance can be split into positive variances and a positive definite invertible correlation matrix so that
Whitening transformation:
Objective: choose so that
For Mahalanobis/ZCA whitening we already know that
In general, the whitening matrix needs to satisfy a constraint:
Clearly, the ZCA whitening matrix satisfies this constraint:
3.3 Parameterisation of whitening matrix
Covariance-based parameterisation of whitening matrix:
A general way to specify a valid whitening matrix is
where is an orthogonal matrix
Recall taht an orthogonal matrix has the property that and as a consequence , thus
The converse is also true: any whitening matrix, i.e. any satisfying the whitening constraint, can be written in the above form as is orthogonal by construction.
β Instead of choosing , we choose the orthogonal matrix
Note:
- recall that orthogonal matrices geometrically represent rotations (plus reflections)
- thus there are infinitely many whitening procedures, because there are infinitely many rotations and thus we need to find ways to choose/select among whitening procedures.
- For the Mahalanobis/ZCA transformation
- whitening can be interpreted as Mahalanobis transformaiton followed by further rotation-reflection
Correlation-based parameterisation of whitening matrix:
We can also express in terms of the corresponding correlation matrix where is the diagonal matrix containing the variances.
Specifically, we can specify the whitening matrix as:
It is easy to verify that this also satisfies the whitening constraint:
Conversely, any whitening matrix can also be written in this form as is orthogonal by construction.
- Another interpretation of whitening: first standardising (), then decorrelation (), followed by rotation-reflection ()
- for Mahalanobis/ZCA transformation
Both forms to write using and are equally valid. But , they are two different orthogonal matrices. And
Even though
3.4 Cross-covariance and cross-correlation for general whitening transformations
A useful criterion to charaterise and to distinguish among whitening transformations is the cross-covariance and cross-correlationmatrix between the original variable and the whiteded variable
- Cross-covariance
between and :
In component notation we write where the row index refers to and the colums index to
Cross-covariance is linked with . Thus, choosing cross-covariance determines . Note that .
The whitening matrix expressed in terms of cross-covariance is , so as required . Furthermore, is the inverse of the whitening matrix, as
- Cross-correlation
between and :
In component notation we write where the row index refers to and the column index refers to
Cross-correlation is linked with . Hence, choosing cross-correlation determines . The whitening matrix expressed in terms of cross-correlation is
Note that the factorisation of the cross-covariance and the cross-correlation into the product of a positive definite symmetric matrix and an orthogonal matrix are examples of a polar decomposition.
3.5 Inverse whitening transformation and loadings
Inverse transformation
Since . The reverse transformation going from the whitened to the original variable is . This can be expressed also in terms of cross-covariance and cross-correlation. With we get:
Furthermore, since we have and hence
The reverse whitening transformation is also known as colouring transformation (inverse Mahalanobis transform is one example)
Definition of loadings:
Loadings are the coefficients of the linear transformation from the latent variable back to the observed variable. If the variables are standardised to unit variance then the loadings are also called correlation loadings.
Hence, the cross-covariance matrix plays the role of loadings linking the latent variable with the original . Similarly, the cross-correlation matrix contains the correlation loadings linking the (already stanardised) latent variable with the standardised .
Multiple correlation coefficients from back to
Note that the components of are all uncorrelated with . The squared multiple correlation coefficient between each and all is therefore just the sum of the corresponding squared correlations :
Since for general linear one-to-one transformation (including whitening as special case), the squared multiple correlation must be 1 because there is no error. We can confirm this by computing the row sums of squares of the cross-correlation matrix in matrix notation
for which it is clear that the choice of is not relevant.
Similarly, the row sums of squares of the cross-covariance matrix equal the variances of the original variables, regardless of :
or in matrix notation:
3.6 Summaries of Phi and Psi resulting from Whitening transformations
A simply summary of a matrix is its trace. For the cross-covariance matrix the trace is the sum of all covariances between corresponding elements in and :
For the cross-correlation matrix the trace is the sum of all correlations between corresponding elements in and :
In both cases the value of the trace depends on and . And there is unique choice such that the trace is maximised.
To maximise , we conduct the following steps:
- Apply eigendecomposition to . Note that is diagonal with positive eigenvalues as is positive definite and is an orthogonal matrix.
- The objective function becomes
Note that the product of two orthogonal matrices is itself an orthogonal matrix. Therefore, is an orthogonal matrix and
- As and all the objective function is maximised for , i.e.
- In turn, this implies that is maximised for
Similarly, to maximise we:
- decompose and then, following the above
- find that is maximised for
Squared Frobenius norm and total variation
Another way to summarise and dissect the association between and the corresponding whitened is the squared Frobenius norm and the total variation based on and .
The squared Frobenius norm (Euclidean) norm is the sum of squared elements of a matrix.
If we consider the squared Frobenius norm of the cross-covariance matrix, i.e. the sum of squared covariances between and
this equals the total variation of and it does not depend on . Likewise, computing the squared Frobenius norm of the cross-correlation matrix, i.e. the sum of squared correlations between and
yields the total variation of which also does not depend on . Note this is because the squared Frobenius norm is invariant against rotations and reflections.
Proportion of total variation
Now compute the contribution of each whitened component to the total variation. The sum of squared covariances of each with all is
with the total variation. In vector notation the contributions are written as the column sums of squares of
The relative contribution of versus the total varitaion is
Crucially, in contrast to total variation, the contributions depend on the choice of .
Similarly, the sum of squared correlations of each with all is
with . In vector notation this corresponds to the column sums of squares of
The relative contribution of with regard to the total variation of the correlation is
As above, the contributions depend on the choice of .
Maximising the proportion of total variation
It is possible to choose a unique whitening transformation such that the contributions are maximised, i.e. that the sum of the m largest contributions of and is as large as possible.
Specifically, we note that and are symmetric real matrices. For these type of matrices we know from Schurβs theorem that the eigenvalues
majorise the diagonal elements . More precisely,
i.e. the sum of the largest m eigenvalues is larger than or equal to the sum of the m largest diagonal elements. The maximum (and equality) is only achieved fi the matrix is diagonal, as in this case the diagonal elements are equal to the eigenvalues.
Therefore, the optimal solution to problem of maximising the relative contributions is obtained by computing the eigendecompositions and and diagonalise and by setting and , respectively. This yields for the maximised contributions
and
with eigenvalues and arranged in decreasing order.
4. Natural whitening procedures
Motivation: introduce several strategies to select an optimal whitening procedure.
Specifically, we discuss the following whitening transformations:
- Mahalanobis whitening, also known as ZCA (zero-phase component analysis) whitening in machine learning
- ZCA-cor whitening (based on correlation)
- PCA whitening (based on covariance)
- PCA-cor whitening (based on correlation)
- Cholesky whitening
notations: and denote the mean-centered variables
4.1 ZCA whitening
Aim:
remove correlations and standardise but otherwise make sure that the whitened vectore does not differ too much form the original vector . Specifically, each latent component should be as close as possible to the corresponding original variable :
One possible way to implement this is to compute the expected squared difference between the two centered random vectors and
ZCA objective function:
minimise to find an optimal whitening procedure.
The ZCA objective function can be simplified as follows:
The same objective function can be obtained by putting a diagonal constraint on the corss-covariance . Specifically, we are looking for the that is closest to the diagonal matrix by minimising
This will force the off-digonal elements of to be close to zero and thus leads to sparsity in the cross-covariance matrix.
The only term in the above that depends on the whitening transformation is as is a function of . Therefore we can use the following alternative objective:
ZCA equivalent objective:
maximise to find the optimal
Solution:
From the earlier discussion we know that the optimal matrix is . The corresponding whitening matrix for ZCA is therefore . And the cross-covariance matrix is and the cross-correlation matrix
Note that is a symmetirc positive definite matrix, hence its diagonal elements are all positive. As a result, the diagonals of and are positive, i.e. and . Hence, for ZCA two corresponding components and are always positively correlated.
Proportion of total variation:
For ZCA with we find that with . Hence for ZCA the proportion of total variation contributed by the latent component is the ratio
Summary:
- ZCA/Mahalanobis transform is the unique transformation that minimises the expected total squared component-wise difference between and .
- In ZCA corresponding components in the whitened and original variables are always positively correlated. This facilitates the interpretation of the whitened variables.
- Use ZCA aka Mahalanobis whitening if we want to βjustβ remove correlations.
4.2 ZCA-Cor Whitening
Aim:
same as above but remove scale in first before comparing to .
ZCA-cor objective function:
minimise to find an optimal whitening procedure.
This can be simplified as follows:
The same objective function can also be obtained by putting a diagonal constraint on the cross-correlation . Specifically, we are looking for the that is closest to the diagonal matrix by minimising
This will force the off-diagonal elements of to be close to zero and thus leads to sparsity in the cross-correlation matrix. The only term in the above that depends ont he whitening transformation is as is a function of . Thus we can use the following alternative objective instead:
ZCA-cor equivalent objective:
maximise to find optimal
Solution:
same as above for ZCA but using correlation instead of covariance
From the earlier discussion we know that the optimal matrix is . The corresponding whitening matrix for ZCA-cor is therefore and the cross-covariance matrix is and the corss-correlation matrix is
For the ZCA-cor transformation we also have and so that two corresponding components and are always positively correlated.
Proportion of total variation:
For ZCA-cor with we find that with all . Thus, in ZCA-cor each whitened component contributes equially ot the total variation , with relative proportion
Summary:
- ZCA-cor whitening is the unique whitening transformation maximising the total correlation between corresponding elements in and
- ZCA-cor leads to interpretable because each individual element in is (typeically strongly) positively correlated with the corresponding element in the original
- As ZCA-cor is explicitly constructed to maximise the total pairwise correlations it achieves higher total correlation than ZCA
- If is stanardised to then ZCA and ZCA-cor are identical
4.3 PCA whitening
Aim:
remove correlations and at the same time compress information into a few latent variables. Specifically, we would like that the first latent component is maximally linked with all variables in , followed by the second component and so on:
One way to measure the total association of the latent component with all the original is the sum of the correpsonding squared covariances
or equivalently the column sum of squares of
Each is the contribution of to i.e. to the total variation based on . As is constant this implies that there are only independent .
In PCA-whitening we wish to concentrate most of the contributions to the total variation based on in a small number of latent components.
PCA whitening objective function:
find an optimal so that the resulting set in majorizes any other set of relative contributions
Solution:
Following the earlier discussion we apply Schurβs theorem and find the optimal solution by diagonalising through eigendecomposition of . Hence, the optimal value for the matrix is
However, recall that is not uniquely defined, we are free to change the colums signs. The corresponding whitening matrix is
the cross-covariance matrix is
and the cross-correlation matrix is
Identifiability:
Note that all of the above (i.e. , , , ) is not unique due to the sign ambiguity in the columns of .
Therefore, for identifiability reasons we may wish to impose a further constraint on for equivalently . A useful condition is to require (for the given ordering of the original variables) that has a positive diagonal or equivalently that has a positive diagonal. This implies that and , hence all pairs and are positively correlated.
It is particularly important to pay attention to sthe sign ambiguity when comparing different computer implementations of PCA whitening (and the related PCA approach)
Note that the actual objective of PCA whitening is not affected by the sign ambiguity since the column signs of do not matter.
Proportion of total variation:
In PCA whitening the contribution of each latent component to the total variateion based on the covariance is . The fraction is the relative contribution of each element in to explain the total variation.
Thus, low ranking components with small may be discarded. In this way PCA whitening achieves both compression and dimension reduction.
Summary:
- PCA whitening is a whitening transformation that maximises compression with the sum of squared cross-covariances as underlying optimality criterion.
- There are sign ambiguities in the PCA whitened cariables which are inherited from the sign ambiguities in eigenvectors.
- If a positive-diagonal condition on the orthogonal matrices is imposed then these sign ambiguities are fully resolved and corresponding components and are always positively correlated.
4.4 PCA-cor whitening
Aim:
same as for PCA whitening but remove scale in first. This means we use squared correlations rather than squared covariances to meansure compression, i.e.
or in vector notatino the column sum of squares of
Each is the contribution of to i.e. the tital variation based on . As is constant this implies that there are only independent .
In PCA-cor-whitening we wish to concentrate most of the contributions to the total variation based on in a small number of latent components.
PCA-cor whitening objective function:
find an optimal so taht the resulting set in majorizes any other set of relative contributions.
Solution:
Following the earlier discussion we apply Schurβs theorem and find the optimal solution by diagonalising through eigendecomposition of . Hence, the optimal value for the matrix is
Again is not uniquely defined β you are free to change signs of the columns. The corresponding whitening matrix is
and the cross-covariance matrix is
and the cross-correlation matrix is
Identifiability:
As with PCA whitening, there are sign ambiguities in the above because the column signs of can be freely chosen. We can impose further constraints on or equivalently on to get identifiability.
A useful condition is to require that the diagonal elements of are all positive or equivalently that has a positive diagonal. This implies that and .
Note that the actual objective of PCA-cor whitening is not affected by the sign ambiguity since the column signs of do not matter.
Proportion of total variation:
In PCA-cor whitening the contribution of each latent component to the total variation based on the correlation is . The fraction is the relative contribution of each element in to explain the total variation.
Summary:
- PCA-cor whitening is a whitening transformation that maximises compression with the sum of squared cross-correlations as underlying optimality criterion.
- There are sign ambiguities in the PCA-cor whitened variables which are inherited from the sign ambiguities in the eigenvectors.
- If a positive-diagonal condition on the orthogonal matrices is imposed then these sign ambiguities are fully resolved and corresponding components and are always positively correlated.
- If is standardised to , then PCA and PCA-cor whitening are identical.
4.5 Cholesky whitening
Aim:
Find a whitening transformation such that the cross-covariance and cross-correlation have lower triangular structure. Specifically, we wish that the original variable is linked with the first latent variable only, the second original variable is linked to and only, and so on, and the last variable is linked with all latent variables :
Thus, Cholesky whitening imposes a structural constraint on the loadings, where the non-zero coefficients are all in the lower half whereas in the upper half the coefficients all vanish.
Cholesky matrix decomposition:
The Cholesky decomposition of a square matrix requires a positive definite and is unique. is a lower triangular matrix with positive diagonal elements. Its inverse is also lower triangular with positive diagonal elements. If is a diagonal matrix with positive elements then is also a lower triangular matrix with a positive diagonal and the Cholesky factor for the matrix
Apply a Cholesky decomposition to
The resulting whitening matrix is
By construction, is a lower triangular matrix with positive diagonal. The whitening constraint is satisfied as
The cross-covariance matrix is the inverse of the whitening matrix:
and the cross-correlation matrix is:
Both and are lower triangular matrices with positive diagonal elements. Hence two corresponding components and are always positively correlated.
The corresponding orthogonal matrices are
and
If we apply Cholesky matrix decomposition to correlation instead of covariance, we will still get the same whitening transform.
The Cholesky factor for is . The corresponding whitening matrix is
This is also intuively clear as the covariance and correlation loadings are closely linked, in particular they share the same triangular shape.
Dependence on the input order:
Cholesky whitening depends on the ordering of input variables. Each ordering of the original variables will yield a different triangular constraint and thus a different Cholesky whitening transform. For example, by inverting the ordering to , we effectively enforce an upper triangular shape.
4.6 Comparison of whitening procedures
Simulated data
Apply ZCA, PCA and Cholesky whitening to a simulated bivariate normal data with correlation
As expected, in ZCA both Cross-Cor1 and Cross-Cor2 show strong correlation. But this is not the case for PCA and Cholesky whitening.
Note that for Cholesky whitening the first component is perfectly positively correlated with the original component .
Iris Flowers
The data set has dimension and sample size , there are 3 species i.e. dimension of y is 4.
Apply above whitening tranforms to this data and then sort the whitening components by their relative contribution to the total variation. (For Cholesky whitening, use the input order for the shape constraint)
As expected, the tow PCA whitening approaches compress the data most. On the other end of the spectrum, the ZCA whitening methods are the two least compressiong approaches. Cholesky whitening is a compromise between ZCA and PCA in terms of compression.
Similar results are obtained based on correlation loadings (note that ZCA-cor provides equal weight for each latent variable)
Summary
Method | Usage |
ZCA, ZCA-cor | Pure decorrelate, Maintain similarity to original data set, Interpretability |
PCA, PCA-cor | Compression, Find effective dimension, Reduce dimensionality, Feature identification |
Cholesky | Triangular shaped W, Phi, Psi, Sparsity |
Other related methods
- Factor models: essentially whitening plus an additional error term, factors have rotational freedom just like in whitening.
- Partial Least Squares (PLS): similar to Principal Components Analysis (PCA) but in a regression setting (with the choice of latent variables depending on the response).
- Nonlinear dimension reduction methods such as SNE, tSNE, UMAP.
5. Principal Component Analysis (PCA)
5.1 PCA transformation
Assume random vector with . PCA is a particular orthogonal transformation of the original such that the resulting components are orthogonal:
where satisfies:
Note that while principal components are orthogonal they do not have unit variance () but the variance of principal components euqals the eigenvalues
Thus PCA itself is not a whitening procedure but it is very closely linked to PCA whitening which is obtained by standardising the principal components:
Compression properties:
The total variation is . With principle components the fraction can be interpreted as the proportion of variation contributed by each component in to the total variation. Thus, low ranking components in with low variation may be discarded, thus leading to a reduction in dimension.
5.2 Application to data
Written in terms of a data matrix instead of a random vector PCA becomes:
There are two ways to obtain :
- Estimate the covariance matrix, e.g. by where is the column-centred data matrix; then apply the eigenvalue decomposition on to get .
- Compute the singular value decomposition of . As we can just use from the SVD of and there is no need to compute the covariance.
5.3 Iris flower data example
First standardise the data, then compute PCA components and plot the proportion of total variation contributed by each component. The plot shows that only tow PCA components are needed to achieve 95% of the total variation:
A scatter plot of the first two principal components is also informative:
The plot shows the grouping among the 150 flowers, corresponding to the species, and that there groups can be characterised by the principal components.
5.4 PCA correlation loadings
For a general whitening transformation the cross-correlation plays the role of correlation loadings in the inverse transformation:
i.e. they are the coefficients linking the whitened variable with the standardised original variable . This relationship holds therefore also for PCA-whitening with and
Even though the classical PCA is not a whitening approach because , we can still compute cross-correlations between and the principal components , resulting in
Note these are the same as the cross-correlations for PCA-whitening since and only differ in scale.
The inverse PCA transformation is
In terms of standardised PCA components and standardised original components it becomes
Thus the cross-correlation matrix plays the role of correlation loadings also in classical PCA, i.e. they are the coefficients linking the standardised PCA components with the standardised original components.
5.5 PCA correlation loadings plot
In PCA and PCA-cor whitening as well as in classical PCA, the aim is compression, i.e. find the latent variables such that most of the total variation is contributed by a small number of components.
The below plot is the correlation loadings plot showing the cross-correlation between the first two PCA components and all four variables of the iris flower data set.
Loading Comments...