03 Unsupervised Learning and Clustering

TOC

1. Hierarchical clustering 1.1 Agglomerative hierarchical clustering algorithms 1.2 Ward’s clustering method 2. K-means clustering 2.1 Algorithm 2.2 Number of Clusters 2.3 K-medoids aka PAM 3. Mixture Models 3.1 Finite mixture model 3.2 Total mean, Total variance and Total Variation 3.3 Sampling and Predicting using a mixture model 3.4 Fitting the mixture model to observed data 3.5 EM Algorithm 3.6 Connection with K-means clustering method 3.7 Invariant States of GMM

1. Hierarchical clustering

Hierarchical clustering includes two categories:

divisive or recursive partitioning algorithms

grow the tree from the root downwards
first determine the main two clusters, then recursively refine the clusters further

agglomerative algorithms

grow the tree from the leaves upwards
successively form partitions by first joining individual object together, then recursively join group of items together, until all is merged

1.1 Agglomerative hierarchical clustering algorithms

Initialisation:

Compute a dissimilarity / distance matrix between all pairs of objects where “objects” are single data points at this stage but later are also be sets of data points.

Iterative procedure

Identify the pair of objects with the smallest distance. These two objects are then merged together into on set. Create an internal node in the tree to represent this set.

Update the distance matrix by computing the distances between the new set and all other objects. If the new set contains all data points, the procedure terminates. The final node created is the root node.

Definition of distance (between two data points):

Eucliden distance:

Squared Eucliden distance:

Manhattan distance:

Maximum norm:

As for distance between two sets:

Complete linkage (max distance):

Single linkage (min distance):

Average linkage (avg. distance):

1.2 Ward’s clustering method

Key idea: In each iteration, two sets and are merged that lead to the smallest increase in within-group variation.

The within-group sum of squares for group is

However, we typically only have pairwise distances available and do not know the group means so applicable methods are

The distance between two sets are

2. K-means clustering

2.1 Algorithm

We assume that there are groups.

Initialisation:

At the start of the algorithms, the observations are randomly allocated with equal probability to one of the groups. The resulting assignment is , with each . is the set of indices of the data points in cluster , is the number of samples in cluster .

Iteractive refinement:

Estimate the group means by

Update the group allocations . Specially, assign each data point to the group with the nearest . The distance is measured in terms of the Euclidean norm:

Step 1 and 2 are repeated until the algorithm converges (until the group allocations don’t change any more) or until a specified upper limit of iterations is reached.

2.2 Number of Clusters

Once the -means algorithm has run we can assess the homogeneity and heterogeneity of the resulting clusters:

total within-group sum of squares , or total unexplained sum of squares:

SSW decreases with and is zero when . K-means algorithm tries to minimise this quantity but it will typically only find a local minimum rather than the global one.

between-group sum of squares , or explained sum of squares:

where is the global mean of the samples. SSB increases with the number of clusters . When , it becomes equal to the total sum of squares SST.

total sum of squares:

By construction for any (i.e. is a constant independent of )

Define:

as the total variation

as the total explained variation

as the totoal unexplained variation

In order to decide on the optimal number of clusters, we run -means for different settings for and then choose the smallest for which the explained variation is not significantly worse compared to a clustering with a substantially larger . An example is shown as below.

Thus, clusters is appropriate since the explained variation does not significantly improve (and the unexplained variation does not significatly decrease) with a further increase of the number of clusters.

2.3 K-medoids aka PAM

A closely related clustering method is -medoids or PAM (”Partitioning Around Medoids”). It works exactly like -means, only that

Instead of the estimated group means one member of each group is selected as its representative (so-called “medoid”).

Instead of squared Euclidean distance, other dissimilarity measures are also allowed.

3. Mixture Models

3.1 Finite mixture model

K groups / classes / categories, with known in advance.

Probability of class : with

Each class has its own pdf with parameters . Density of class :

The conditional means and variance for each class are and

The resulting mixture density for the observed variable is

A very often used model is GMM:

3.2 Total mean, Total variance and Total Variation

Using the law of total expectation, we obtain the mean of the mixture density as follows:

Using the law of total variatio, we compute the marginal variance:

The total variance decomposes into the explained (between group) variance and the unexplained (within group variance.)

Total variation

Total variation is the trace of the covariance matrix.

If the covariances are replaced by their empirical estiamtes we obtain decomposition of total variation familiar from -means:

3.3 Sampling and Predicting using a mixture model

Assume we know the component densities and the mixture mode is . The sampling procedure is done in two steps:

Draw from categorical distribution with parameter :

Here, the vector indicates a hard group 0/1 allocation, with all components except for a single entry .

Subsequently, sample from the component selected in step 1:

This two-stage sampling approach is also known as hierarchical generative model for a mixture distribution.

Assume that the mixture model known (either in advance or after fittint it). Bayes’ theorem allows predict the probability that an observation falls in group :

The posterior probabilities in provide a so-called soft assignment of the sample to all classes rather than a 0/1 hard assignment to a specific class.

To obtain at a hard clustering and to infer the most probable latent state, we select the class with the highest probability

3.4 Fitting the mixture model to observed data

Choosing the number of classes

Since GMMs operate in a likelihood framework we can use penalised likelihood model selection criteria to choose among different models. The most popular choices are AIC (Akaike Information Criterion) and BIC (Bayesian Information criterion) defined as below:

where is the number of groups and is the maximized value of the likelihood function.

In order to choose a suitable model, we evaluate different models with different and then choose the model that minimizes AIC or BIC.

Another way of choosing optimal numbers of clusters is by cross-validation.

Maximize Likelihood Estimation

The direct way to fit a mixture model by maximum likelihood is to maximise the observed data log-likelihood function with regard to :

where the observed data log-likelihood can be computed from the complete date likelihood function by marginalising over

The form of the observed data log-likelihood function prevents analytic simplifications (note the sum inside the logarithm) and thus can be difficult to compute.

By the way, we can also construct the observed data log-likelihood

3.5 EM Algorithm

EM algorithm is a general technique to handle latent variables.

Initialisation

Start with a guess of the parameters , then continue with “E” step. Part A

Alternatively, start with a guess of the soft allocations for each sample , collected in the matrix , then continue with “E” step. Part B

E (expectation) step

Part A: Use Bayes’ theorem to compute new probabilities of allocation to class for all the samples :

Note that to obtain , the current estimate of the parameters of the mixture model is required.

Part B: Construct the expected complete data log-likelihood function for using the soft allocations collected in the matrix

Note that in the case that the soft allocations turn into hard 0/1 allocations then becomes equivalent to the complete data log-likelihood.

M (Maximisation) step

Maximise the expected complete data log-likelihood to update the estimates of mixture model parameters:

Continue with 2 until series has converged.

Note that to avoid singularities in the expected log-likelihood function, we may need to adopt regularisation (i.e. penalised maximum likelihood or Bayesian learning) for estimating the parameters in the M-step.

EM algorithm for GMM

The EM algorithm for EM can be expressed analytically.

E-step:

Update the soft allocations:

M-step:

The number of samples assigned to class in the current step is

Note this is not necessarily an integer because of the soft allocations of samples to groups. The updated estimate of the group probabilities is:

The updated estimate of the mean is:

and the updated covariance estimate is

Note that if is a hard allocation, which means that for any only one class has weight 1 and all others weight 0. Then all estimators above reduce to the usual empirical estimators.

3.6 Connection with K-means clustering method

Assume a simplified model where the probabilities of all classes are equal (i.e. ) and where the covariances are all of the same spherical form . Thus, the covariance does not depend on the group, there is no correlation between the variables and the variance of all variable is the same.

In E step, using the mixture model, the soft assignment for class allocation becomes

where const does not depend on . This can turned into a hard class allocation by

which is exactly the -means rule to allocate samples to groups.

In M step, we compute the parameters of the model. If the class allocations are hard the expected log-likelihood becomes the observed data likelihood and the MLE of the group mean is the average of samples in that group.

Thus, -means can be viewed as an EM type algorithm to provide hard classification based on a simple restricted Gaussian mixture model.

3.7 Invariant States of GMM

Although EM algorithm is guaranteed to monotonically converge to local optimal of the observed data log-likelihood. There are also situations in which there is no convergence at all to because the EM algorithm remains in an invariant state.

An example of such an invariant state for GMM is uniform initialisation of the latent variables , where is the number of classes.

With this we get in the M step and as parameter estimates:

None of thess actually depend on the group . Thus, in the E step when the next soft allocations are determined this leads to

Aftern on cycle in the EM algorithm, we arrive at the same soft allocation that we started with, and the algorithm is trapped in an invariant state. Therefore, uniform initialisation should clearly be avoided.