Support Vector Machines
Two-category data that can be cleanly divided by a single line with no misclassifications is called linearly separable. The data below are linearly separable. Given the lines shown in the image, the middle line best separates the data.
The other two lines are more likely to be overfit to the data. They may “believe the data too much.” A specific way of articulating this is to say the middle line is consistent with the data, while committing least to it.
Hyperplane Derivation
In the example shown below, one can imagine the two grey lines as boundary lines beyond which misclassifications would occur. The line in the middle is the best separator because it leaves the maximum space between these hypothetical boundary lines.
The general equation of a line is
But, we are interested in a formulation that would allow for a more general case that would work with hyperplanes. In particular, this generalized equation follows:
where:
are the classification labels that indicates which class the data is in, specifically, are parameters of the plane, and are parameters of the plane that specifically translate the plane away from the origin.
The actual equation of the separating hyperplane would be given by the following, because this is where the output would be zero.
The top and bottom grey lines in the following figure are given by
Margin Derivation
The objective, for reasons previously outlined, is to have a support vector machine with the maximum possible distance between the grey hyperplanes. To find an equation for this distance, we subtract the equations listed above from one another, resulting in:
Next, divide by the length of the vector
Again, the whole notion of finding the optimum decision boundary is the same as finding a line that maximizes the margin, while classifying everything correctly.
Maximizing the Margin
“Classifying everything correctly” is expressed mathematically as:
Maximizing
Further, minimizing
such that
The solution to the foregoing equation has the following properties.
- Once
has been solved for, can likewise be solved: . Once has been solved, it is easy to recover . are mostly 0. A logical result is that only a few of the vectors, , in the previous bullet actually matter, and the “machine” only requires a few “support vectors.” The support vectors are the vectors for which .- The support vectors are the vectors near the decision boundary, which actually constrain the line. This makes support vector machines conceptually similar to k-nearest neighbors, except that some work is done up front to compute which points matter.
- The
part of the function is a dot product of those two vectors, which is essentially a measure of their similarity.
Linearly Married
As example of a set of linearly married points is shown below. The idea is that there is no linear separator that can drawn to correctly classify the points.
In order to use SVMs with data like that shown above, it must be transformed. The original 2-dimensional points,
Example
As an example, consider the following two data points
The dot product of two points is given by
Which, itself, is the dot product of
This means we have transformed from thinking about the linear relationship between the datapoints to thinking about something that looks a lot more like a circle. The notion of similarity, as captured by
In this specific example, all the minuses move “up,” out of the screen, a lot. The pluses, being closer to the origin, move up less. As a result, the data can be cleanly separated by a hyperplane.
Additionally, in the case of transforming
Then, it is as if we had projected all of the data into this third dimension, but in reality, we never used the function
This (kind of) transformation into a higher dimensional space in order to separate the data is called a “Kernel Trick.”
Kernel Trick
The actual “Kernel” is the function
The Kernel takes the parameters
- The Kernel function is our representation of similarity between those two points.
- It is also a mechanism by which we inject domain knowledge into the support vector machine learning algorithm.
There are several types of kernels, including:
- Polynomial kernel:
- Radial basis kernel:
- Sigmoid kernel:
Since the Kernels are just arbitrary functions that return a number,
Mercer Condition
The Mercer Condition is a technical condition, but in essence it means that the Kernel function must act like a well-behaved distance function. This is the only real criterion a Kernel function, generally, must meet.
For Fall 2019, CS 7461 is instructed by Dr. Charles Isbell. The course content was originally created by Dr. Charles Isbell and Dr. Michael Littman.