Difference between revisions of "Support Vector Machines"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'This is a machine learning [[Category::Method | model]].')
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is a machine learning [[Category::Method | model]].
+
This is a machine learning [[Category::Method | model]].  
 +
 
 +
== Summary ==
 +
 
 +
Support Vector Machines are supervised models that take positive and negative instances of classes in some feature space as training instances, and learn a linear decision boundary between the two classes. Unlike probability based methods such as logistic regression that attempt to learn a hyperplane that maximizes the likelihood of the training data, SVMs rely on margin and learn hyperplanes that maximize the distance separation between positive and negative instances. For cases where the training data isn't perfectly separable, they make use of slack variables.
 +
 
 +
== Formal Description ==
 +
We are given some training data <math>\mathcal{D}</math>, a set of ''n'' points of the form
 +
 
 +
:<math>\mathcal{D} = \left\{ (\mathbf{x}_i, y_i)\mid\mathbf{x}_i \in \mathbb{R}^p,\, y_i \in \{-1,1\}\right\}_{i=1}^n</math>
 +
 
 +
where the ''y''<sub>''i''</sub> is either 1 or −1, indicating the class to which the point <math>\mathbf{x}_i </math> belongs. Each <math> \mathbf{x}_i </math> is a ''p''-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having <math>y_i=1</math> from those having <math>y_i=-1</math>.
 +
 
 +
We want to choose the <math>{\mathbf{w}}</math> and <math>b</math> to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data, i.e.
 +
 
 +
Minimize (in <math>{\mathbf{w},b}</math>)
 +
 
 +
: <math>\|\mathbf{w}\|</math>
 +
 
 +
subject to (for any <math>i = 1, \dots, n</math>)
 +
 
 +
: <math>y_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1. \, </math>
 +
 
 +
We can substitute ||'''w'''|| with <math>\tfrac{1}{2}\|\mathbf{w}\|^2</math> without changing the solution.  This now becomes a quadratic programming problem. Some of this math is obtained from Wikipedia, refer to the corresponding Wikipedia article for more details.
 +
 
 +
== Training ==
 +
There exist several specialized algorithms for quickly solving the Quadratic Programming problem that arises from SVMs, mostly reliant on heuristics for breaking the problem down into smaller, more-manageable chunks. A common method is the Sequential Minimal Optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that can be solved analytically and runs in cubic time.
 +
 
 +
== Advantages ==
 +
 
 +
* They are more robust to overfitting, since the final model relies only on support vectors, as opposed to all the instances of the training set.
 +
* The use of the Kernel trick allows us to efficiently use custom kernels for any given task
 +
* They can easily handle very high dimensional data
 +
 
 +
== Applications ==
 +
 
 +
They are applicable for a variety of binary as well as multiclass classification problems such as text classification, and have recently also been applied for regression problems.
 +
 
 +
 
 +
== Relevant Papers ==
 +
 
 +
{{#ask: [[UsesMethod::Support Vector Machines]]
 +
| ?AddressesProblem
 +
| ?UsesDataset
 +
}}

Latest revision as of 22:25, 31 March 2011

This is a machine learning model.

Summary

Support Vector Machines are supervised models that take positive and negative instances of classes in some feature space as training instances, and learn a linear decision boundary between the two classes. Unlike probability based methods such as logistic regression that attempt to learn a hyperplane that maximizes the likelihood of the training data, SVMs rely on margin and learn hyperplanes that maximize the distance separation between positive and negative instances. For cases where the training data isn't perfectly separable, they make use of slack variables.

Formal Description

We are given some training data , a set of n points of the form

where the yi is either 1 or −1, indicating the class to which the point belongs. Each is a p-dimensional real vector. We want to find the maximum-margin hyperplane that divides the points having from those having .

We want to choose the and to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data, i.e.

Minimize (in )

subject to (for any )

We can substitute ||w|| with without changing the solution. This now becomes a quadratic programming problem. Some of this math is obtained from Wikipedia, refer to the corresponding Wikipedia article for more details.

Training

There exist several specialized algorithms for quickly solving the Quadratic Programming problem that arises from SVMs, mostly reliant on heuristics for breaking the problem down into smaller, more-manageable chunks. A common method is the Sequential Minimal Optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that can be solved analytically and runs in cubic time.

Advantages

  • They are more robust to overfitting, since the final model relies only on support vectors, as opposed to all the instances of the training set.
  • The use of the Kernel trick allows us to efficiently use custom kernels for any given task
  • They can easily handle very high dimensional data

Applications

They are applicable for a variety of binary as well as multiclass classification problems such as text classification, and have recently also been applied for regression problems.


Relevant Papers