MLII-2

SVM

Maximum margin classifier

When two classes can be perfectly linearly separated, we can just draw a linear boundary to separate them,

The boundary plane only depends on the points right at the boundary.

Support vector classifier

Soft margin is introduced to allow some violations. While strict margins give the boundary points too much influence, soft margins can tune the variance of the boundary.

Variable encodes point location:

: outside the margin

: inside the margin

: across the boundary

Tuning parameter determines “softness” of the margin. Large makes it easier to cross the margin (and even the boundary). also trades off bias and variance,

Large (like plot 1, 2, allowing more datapoints within the margins) will have high bias and low variance

Small (like plot 3, 4, allowing less datapoints within the margins) will have low bias and high variance

Kernel Methods

Support Vector Classifier (SVM) gives linear boundaries. If we want some non-linear boundary, we can use the kernel methods.

The idea to obtain non-linear boundary is that, try to map the data points into a higher dimension space where different classes can be separated through a linear boundary.

It can be proved that the SVM only depends on the inner products between pairs of observations and . Suppose we knew a nice, high dimensional transformation , and we wanted to fit a linear SVM in that space to get better separation. We would need the inner products in that higher dimensional space. To achieve so, we can define a kernel function to compute the inner products directly, without explicitly forming the high-dimensional vectors.

Linear kernel:

Polynomial kernel:

Radial kernel (RBF):

Example:

Suppose we observe points , but we want to work in the space

so that we can have quadratic boundaries. (Constants are just chosen to make the math nice)

Suppose we have two points in :

If we firstly transform them into the high-dimension space:

the inner product is thus .

If we directly use the kernel function (in this case, use the polynomial kernel with ),

N-Classes Classification

Note that single SVM is a 2-classes classifier. To classify n classes, we can build different SVMs. Specifically, we have two approaches,

One v.s. Rest (ovr): For each class, build a SVM classifier, the labels are this class or not ⇒ less parameters, high bias, low variance

One v.s. One (ovo): For each combination of two different classes, build a SVM classifier ⇒ more parameters, low bias, high variance

Anomaly Detection

Suppose we are observing a sequence of observations that are supposed to be i.i.d. form some distribution . Anomaly detection is the problem of recognizing that look “unusual”.

If we know , we can just look for points that rare under . For example, if is Gaussian, we can look for points outside the area.

If we do not know , we can use the one-class SVM as one option.

In the usual SVM, we seek to separate two classes of points with a plane, we can be represented as below optimization:

In the one-class optimization, we seek to separate the points form the origin

Project point to a higher dimensional space

Separate all the data points from the origin in the feature space using hyper-plane

Unlike traditional SVM where we use soft margin for smoothness, we use a parameter that fixes fraction of outliers in the data (following )

Maximize the distance between the hyper-plane and the origin

The points lying below the hyper-plane and closer to origin are outliers

where is the offset (threshold to decide which points are inside or outside the learned decision boundary).