TOC
1. Math Basics1.1 Math objects & Simple Operations1.2 Random variable & Probability1.3 Gradient-based optimization2. Machine Learning Basics2.1 Learning algorithms2.2 Capacity, overfitting and underfitting2.3 Hyperparameters2.4 Validation sets2.5 Maximum likelihood estimation (MLE)2.6 Stochastic gradient descent (SGD)
1. Math Basics
1.1 Math objects & Simple Operations
1.1.1 Math Objects
- Scalar: a single number, often denoted by a lower case letter without boldface, e.g.
- Vector: an array of numbers, often denoted by a lowercase letter with boldface, e.g.
- Matrix: a 2D array of numbers, often denoted by an uppercase letter with boldface, e.g.
- Tensor: a n-D array of numbers, often denoted like
1.1.2 Simple operations
- Matrix transpose: :
- Matrix product: if and , then with shape and
- Elementwise product (Hadamard product): where the 3 matrices are of the same shape and
1.2 Random variable & Probability
A random variable is a variable that can take on different values randomly (definition in this course, which is not identical to definition in probability measure theories).
1.2.1 Probability distributions
- PMF (Probability Mass Function): a function that maps from a discrete random variable to the probability of each possible outcome
- The prob that is denoted as or
- PDF (Probability Density Function): a function that maps from a continuous random variable to the probability density at each point in the space of possible outcomes
- The prob that lies in the interval is given by
1.2.2 Expectation
the average value that takes on when is drawn from ,
- for discrete variables
- For continuous variables
Expectation is linear: if do not dependent on , then
1.2.3 Typical Prob distributions
- Bernoulli distribution: a discrete distribution of a single trial that has only two possible outcomes, typically labeled 0 and 1
- Multinoulli or categorical distribution: over a single discrete variable with different states where is finite
- , where (one hot vector having length of k) and
- Normal distribution: a continuous distribution
- ,
- The central limit theorem shows that the sum of many independent variables is approximately normally distributed.
1.2.4 Marginal probability
- Let denote the prob distribution of discrete random variables , then
- Let denote the PDF of continuous random variables , then
1.2.5 Conditional probability
The conditional probability is the probability of some event, given that some other event has happened.
- The conditional prob that given is
- Chain Rule:
1.3 Gradient-based optimization
The funciton we want to minimize or maximize is called objective function. When we are minimizing it, we may also call it the cost function, loss function, or error function.
The derivative of a function , deonted by or , gives the slope, or gradient, of at the point .
1.3.1 Gradient descent
where is the learning rate. For a function , the gradient decent method becomes
where and
1.3.2 Derivative of two-step composition
- Independent variables:
- Each is a function of
- Each is a function of
for any and
2. Machine Learning Basics
2.1 Learning algorithms
βA computer program is said to learn from experience w.r.t. some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with experience .β β Tom Mitchell, 1997
- Machine learning (ML) tasks are usually described in terms of how the ML system should process an example
- An example is a collection of features that have been quantitatively measured from some object or event
Classic Tasks T:
- Classification: Suppose there are categories. Find a function
- Regression: Find a function , and m is often 1. Regression results might be converted to classification results
- Synthesis and sampling:
- Denoising
- Transcription
- Machine translation
Performance measure, P
A perfomance measure is required to quantitatively evaluate the performance of a ML system. Usually this measure is specific to the task being carried out by the system.
- Classification and transcription: accuracy or error rate
- Regression and denoising: distance between the ground-truth and prediction
- Synthesis, manchine translation: difficult and sometimes need human evaluation
Experience, E
ML algorithms can be broadly categorized as unsupervised and supervised by what kind of experience they are allowed to have during the learning process.
- Unsupervised learning: learn
- Supervised learning: learn
2.2 Capacity, overfitting and underfitting
Regularization
We often build a set of preferences into the learning algorithm, which is embodied by a regularized . For polynomial regression, the total cost function becomes
where . Note that , where the regularization term gets its name βweight decayβ.
2.3 Hyperparameters
Many ML algorithms have two sets of parameters:
- Hyperparameters: control the algorithmβs behavior and are note adapted by the algorithm itself. They often determines the capacity of the model
- Learnable parameters: can be learned from data
2.4 Validation sets
We need another set on which the model is not trained on to choose the hyperparameters.
2.5 Maximum likelihood estimation (MLE)
Problem definition
- Given a set of examples drawn independently from the true but unkown data-generating distribution
- Find a prob distribution to approximate
- Task: find optimal
Assumption (logic): The observed data samples are generated from with the maximum probability over all possible .
The MLE for is defined as:
Log-likelihood is usually used
where is the empirical distribution. The equation above is equivalent to
Conditional log-likelihood
Estimate a conditional probability in order to predict given . For example, for classification, is a (discrete) random variable representing label of an input .
If represents all inputs and represents all observed targets, then the conditional maximum likelihood estimator is
If the examples are assumed to be i.i.d., then this can be decomposed into
2.6 Stochastic gradient descent (SGD)
When minimizing the cost function over the entire training set is computationlly expensive (sometimes even impossible), we can decompose the training set into minibatches and optimize the cost function defined over individual minibatches
ranges from 1 to a few hundreds.
At every iteration, update as follows:
Advantages of SGD:
- Avoid large memory requirement when dealing with large training data
- Stochasticity is beneficial for escaping from βtrapsβ
Loading Comments...