Motivation
Discriminative Models
- Conditional distribution
- Supervised learning: assign correct labels to data
Examples:
- Naive Bayes
- Hidden Markov model
- Support Vector Machine
- Multilayer perceptron
- CNN, RNN, etc.
Generative Models
- Joint distribution
- Unsupervised learning: discover the hidden structure in the data
Examples:
- Restricted Bolzmann Machine, RBM
- Deep Belief Network, DBN
- Deep Boltzmann Machine, DBM
- Denoising Autoencoders
- Generative adversarial nets
Intro to Generative Model
Statistics Estimation
Suppose we have a dataset of observations of . The observations have been generated according to some unknown distribution . We use a generative model to mimic . If we achieve this goal, we can sample from to generate observations that apper to have been drawn from .
should satisfies:
- Rule 1: It can generate examples that appear to have been drawn from
- Rule 2: It can generate examples that are suitably different from the observations in β we should not simply reproduce things it has already seen.
Sample space: the set of all values an observation can take. For example, consider images with values between 0 and 255, the sample space is β high dimensional problem (curse of dimension, non-parametric methods wonβt work). However, only a very small subset of this space is reasonable for an image.
A probability density function defined on the sample space, denoted as . Parametric modeling: a family of density functions with unknown but estimable parameters .
Maximum Likelihood Estimate (MLE)
The likelihood function is . If a whole dataset contains independent observations, we have
For analytical simplification, we consider the log-likelihood
Maximum Likelihood estimate (MLE) can be found
To get a satisfying estimation, we shall assume proper model setting . Otherwise, there will be a large bias in estimation.
Naive Bayes
Consider an example, we have 100 pictures (fashions), each picture described as pixels, thus one picture can be described as a high-dimension vector
Naive bayes just simply assumes,
We can locally estimate with non-parametric method
This method usually does not work well if we use it to generate picture of fashions since we ignore the correlations between pixels especially neighborhood.
The challenge: how to model the dependence between pixels, generally, between features. Thus deep learning comes in.
Representation Learning
We can think of the features as being generated by hidden (latent) variables with the structure
with , and are independent of each other.
All features are generated from a function of independent random variables , with
We can think of as commom factors driving the dependence of the observed features.
If we interpret as a linear function, we obtain a factor model
We can also introduce some independent noise or randomness beyond the common hidden factors
where .
Usually, . Alternatively, we can think of as a lower-dimensional latent representation of , usually . For example, we can map each biscuit in below figure into a point in 2-dimension space, one coordinate represents the diameter and another represents the height.
Autoencoder
Consider a high dimension observation, say an image, denoted as .
- Encoder: we make a representation map: , denoted as , we can use a neural network to estimate the mapping
- Decoder: , denoted as
To train the model, we construct the loss function
Usually, the dimension of is much lower than the dimension of .
The idea is inspiring. However, if we just project into a point in the lower-dimension space, the projections are sparse and incontinuous. Itβs unstable when used for sampling
Variational Autoencoder
Β
Β
Different from the autoencoder, there is another loss function (KL divergence) in Variational Autoencoder beyond capturing the difference between the original and generated one.
Kullback-Leibler (KL) divergence: measure how much one probability distribution differs from another. Consider two distribution defined on the same space , the KL divergence is defined as (Q relative to P)
Notes:
- is not symmetric, thus not the βdistanceβ defined in traditional way
- , the equality holds if and only if
- In continuous version, the KL distance is defined as
Beyond the reconstruction los
VAE includes additional loss (KL divergence) to capture the deviation of the embedded distribution from the standard normal distribution. We hope to embed around zero in the lower dimension space with some symmetry, so that it is easy for us to sample points. The KL divergence used is
Β
Β
Smmary of VAE
VAE adds two elements:
- in the encoder: mapping the original point to a normal distribution in latent space with mean and variance estimated from the neural network. This is useful to help avoid discontinuity
- KL divergence: additional penalty to measure the deviation from the standard normal distribution. This is helpful to avoid large shift away from zero and guarantee some symmetry
VAE representation has good interpretations:
- The vector in the latent space ca be used to capture high-level features
- The transition is smooth when you move from one point in latent space to another along a straight line
Loading Comments...