Support Vector Machines — It’s not-so-complicated!

17 min readAug 18, 2021

Have you ever wondered while coming across the ‘house price’ (denoted by y) prediction task using Linear Regression that the price might not be directly a linear combination of the different features such as ‘size of house’, ‘number of bedrooms’, ‘number of neighbours’ etc? It is absolutely possible that the price might be a non-linear function of the features, isn’t it? It is also quite intuitive that this can be extended to Logistic Regression as well.

Keeping this in mind, let us first understand how to incorporate these non-linear features using a trick called the ‘Kernel Trick’ and then proceed with the beautiful algorithm of ‘Support Vector Machines’.

Kernel Methods

Feature maps

Let us go back to the ‘House Price Prediction’ task where the price of the house y is to be determined by the size of the house x. In this scenario, we assume that the dependency between y and x is non-linear. Furthermore, as an example, we will fit a cubic function of x to estimate y.

It turns out that we can view the cubic function as a linear function over a different set of feature variables (denoted below). Consider,

Now, considering 𝞱 to be a four-dimensional vector with entries 𝞱0,𝞱1,𝞱2,𝞱3, the cubic function can be expressed as

As the original input variable x is mapped to a vector using some function in order to incorporate non-linearity, we call this mapped vector as the feature map.

Cool! we are done with obtaining the non-linear features. The steps further to estimate the price of the house is simply fitting a Linear Regression model using Gradient Descent with the derived feature map.

Kernel Trick

In the above example, we considered just a single input attribute/feature i.e., ‘size of the house’. It may sometimes be true that the target variable is a function of many input attributes. So, in such a case, mapping the input attributes to some non-linear space would make the dimension of the feature map very high, therefore, increasing the computational complexity of the gradient descent algorithm. In order to make this efficient, let us look at a trick, known as the Kernel Trick.

Well, there is a theorem called the Representer Theorem that states that the parameter vector 𝞱 can be expressed as a linear combination of the input features. Using this result,

where the sum is over all the training examples.

Now, as we enforce the update rule for gradient descent, it can be easily claimed that 𝜽 still remains a linear combination of the feature maps after the update. Mathematically,

So, a clever strategy can be to implicitly represent the high dimensional parameter vector as a linear combination, whereas at every iteration the coefficients can be updated, therefore, reducing the computational complexity. The update rule for the coefficients is quite simple.

Observe that in the above equation, we haven’t updated the parameter vector and it is still a linear combination of the old coefficients. Replacing the parameter vector in terms of the old coefficients, we get

So, that’s it!! We have reached at a position where the algorithm looks efficient. So, coming to the end of this part, the inner product between two feature maps is nothing but the ‘Kernel’ of the feature maps denoted as shown below.

Now, with the knowledge of Kernels, let us get an intuition about ‘Support Vector Machines’.

Support Vector Machines

Let us start with a story. Suppose there are two countering groups of friends battling in a contest. The contest goes like this, the groups are scattered in a field, and friends in the same group are nearer to each other. The organizer wants to place a rope between the two groups and the group that first reaches the rope will be declared a winner. So, how would the organizer place the rope in between so that it won’t be unfair to any group?

Well, this scenario can be thought of as the core concept behind Support Vector Machines. Thinking for a while, it is quite intuitive that the rope should be placed as far as possible from the nearest member of each group, or say in between the two members from different groups that are closest to each other. Keeping this in mind, let us mathematically understand how we get the line of rope.

Notation

Earlier, for Logistic Regression, we were using the class labels as 0 and 1, but for simplicity, we will now use a different notation for the classes i.e., -1 and +1. We will also be considering a linear classifier which can be written as

Here, g(z) = 1 if z is greater than or equal to zero, and g(z) = -1 otherwise. Note that we use w, b which explicitly allows us to handle the intercept term separately from the other parameters.

Functional and Geometric Margins

In order to arrive at some conditions for support vector machines, we first need to understand the two types of margins i.e., functional margin and geometric margin.

Formally, the functional margin defined with respect to (w, b) for a particular training example is as follows:

Now, to define the functional margin for a training set consisting of m examples as the minimum among all those functional margins.

Now, consider the picture below,

The decision boundary corresponding to (w, b) is shown in the picture along with the vector w. At this point, convince yourself that w is orthogonal to the line separating the two classes.

Note: The line or plane separating the two classes is often called the hyperplane and for the rest of the article, we will be using hyperplane instead of line or plane.

So, the geometric margin of a particular training example is defined as the distance of that point from the separating hyperplane. Moreover, the geometric margin for the training set is defined as the minimum value of the geometric margins associated with all the training examples.

Now, it’s just high school maths to find the distance of a point from a hyperplane given the normal vector to that hyperplane (in this case, w).

So, the geometric margin for a particular training example is found out to be

Note that if ||w|| = 1, then the functional margin equals the geometric margin, thus this gives us a way of relating these two different notions of margin. Also, the geometric margin is invariant to rescaling of the parameters; i.e., if we replace w with 2w and b with 2b, then the geometric margin does not change. This will in fact come in handy later. Specifically, because of this invariance to the scaling of the parameters, when trying to fit w and b to training data, we can impose an arbitrary scaling constraint on w without changing anything important; for instance, we can demand that ||w|| = 1, or |w1| = 5, or |w1 + b| + |w2| = 2, and any of these can be satisfied simply by rescaling w and b.

Therefore, the geometric margin for the whole training set is defined as

The Optimal Margin Classifier

It now becomes intuitive that a good classifier would like to maximize the geometric margin for the training set since this would reflect pretty confident predictions. Specifically, this will separate the positive and negative training examples with a ‘gap’.

As a starting case, let us assume that our data is linearly separable and a hyperplane can easily separate them. Keeping this in mind, we can put forward the following optimization problem.

i.e., we want to maximize the geometric margin subject to the condition that each training example has a functional margin that is at least equal to the geometric margin of the set, along with the constraint ||w|| = 1. Thus, solving the above optimization problem, we can get the parameters w, b that would maximize the geometric margin.

The constraint ||w|| = 1 poses some sort of difficulty while solving this optimization problem. So, in order to get rid of that, we pose the optimization problem in a different form.

Whoo! Now, the objective function becomes nasty. Don’t worry, we also have a solution to make it look simple and beautiful. Recall that we can add an arbitrary scaling constraint on w and b without changing anything. This is the key idea we’ll use now. We will introduce the scaling constraint that the functional margin of w; b with respect to the training set must be 1.

Note that maximizing 1/||w|| is the same as minimizing ||w||2. So the final form of our optimization problem to find w, b is as follows:

JARGON ALERT!!!

Now, we will go on a bumpy ride and deal with some tough and tricky mathematical stuff related to Support Vector Machines.

Lagrange Duality

For a moment, let us put aside SVMs and concentrate on solving optimization problems with constraints.

Consider a problem of the following form

You might recall solving such constraint optimization problems using Lagrange multipliers. If you don’t, have a cup of coffee and read ahead!!

We defined the Lagrangian to be

Here, βi’s are the Lagrange multipliers and we can find the optimal condition by setting the partial derivative of the Lagrangian with respect to w and β equal to zero.

Generalizing the optimization problem where we also include inequality constraints along with equalities, the optimization problem also called Primal optimization problem can be posed as

And the general form of the Lagrangian can be written as

Here, the ⍺i’s and βi’s are the Lagrange multipliers. Now, consider the quantity

Here, the P subscript stands for primal. Let some w be given. If w violates any of the primal constraints (i.e., if either g(w) > 0 or h(w) >= 0 for some i), then you should be able to verify that

Conversely, you may also note that, if the conditions of primal are indeed satisfied, then

Thus, 𝜽p takes the same value as the objective in our problem for all values of w that satisfy the primal constraints and is positive infinity if the constraints are violated. Hence, if we consider the minimization problem

we see that it is the same problem (i.e., and has the same solutions) as our original, primal problem.

Now, let’s look at a slightly different problem. We define

Here, the D subscript stands for dual. Note also that whereas in the primal problem we were optimizing (maximizing) with respect to 𝛼,β, but here we are minimizing with respect to w. We can now pose the dual optimization problem:

This is exactly the same as our primal problem shown above, except that the order of the ‘max’ and the ‘min’ is now exchanged.

How are the primal and the dual problems related? It can easily be shown that

At this point in time, you need to convince yourself that the “max-min” of a function always being less than or equal to the “min-max”. But there is a certain condition when equality holds.

If the following conditions, also known as Karush-Kuhn-Tucker (KKT) conditions, are satisfied for some value of the parameters and the Lagrange multipliers, then we can say that the solution of the primal and dual are the same and are equal to the values that satisfy the KKT condition.

Drawing attention to the third equation, which is also called the KKT dual complementarity condition, it implies that if 𝛼i > 0, then g(w) = 0. Keep this in mind, which will be used later to show that a small number of support vectors will give us our convergence. Don’t worry about the terminologies now, and keep reading.

Optimal Margin Classifiers Revisited

Now, let us revisit the optimal margin classifiers for SVM. Previously, the optimization problem was posed as follows:

The constraint can be written as

We have one such constraint for each training example. Note that from the KKT dual complementarity condition, we will have only for the training examples that have functional margin exactly equal to one (i.e., the ones corresponding to constraints that hold with equality, g(w) = 0). Consider the figure below, in which a maximum margin separating hyperplane is shown by the solid line.

The points with the smallest margins are exactly the ones closest to the decision boundary; here, the three points (one negative and two positive examples) lie on the dashed lines parallel to the decision boundary. Thus, only three of the 𝛼i’s namely, the ones corresponding to these three training examples will be non-zero at the optimal solution to our optimization problem. These three points are called the support vectors in this problem. The fact that the number of support vectors can be much smaller than the size of the training set will be useful later.

Let’s move on. Looking ahead, as we develop the dual form of the problem, one key idea to watch out for is that we’ll try to write our algorithm in terms of only the inner product, which we discussed in the Kernels section. The fact that we can express our algorithm in terms of these inner products will be key when we apply the kernel trick.

When we construct the Lagrangian for our optimization problem we have:

Note that the problem has only inequality constraints and no equality constraints.

Let’s now find the dual form of the problem. To do so, we need to first minimize Lagrangian with respect to w and b (for fixed multiplier), to get the dual, which we’ll do by setting the derivatives of Lagrangian with respect to w and b to zero. We have:

This implies that

As for the derivative with respect to b, we obtain

If we consider the expression for w and plug that back into the Lagrangian subsequently with some simplification, we get

But we have obtained that the last term must be zero, so we obtain

TIRED OF THIS MIND-BLOWING STUFF! HUH!

DON’T WORRY, TAKE A CHILL PILL AND COME BACK. THERE IS STILL SOME WAY TO GO 😊 MACHINE LEARNING IS REALLY A BUMPY RIDE, SAME AS LIFE 😊.

So, after your return, just recall that we got to the equation above by minimizing the Lagrangian with respect to w and b. Putting this together with the constraints the multipliers are non-negative (that we always had), we obtain the following dual optimization problem:

You should also be able to verify that the conditions required for p* = d* and the KKT conditions to hold are indeed satisfied in our optimization problem. Hence, we can solve the dual in lieu of solving the primal problem. Specifically, in the dual problem above, we have a maximization problem in which the parameters are the Lagrange multipliers. We’ll talk later about the specific algorithm that we’re going to use to solve the dual problem, but if we are indeed able to solve it (i.e., find the multipliers that maximize W subject to the constraints), then we can go back and find the optimal w’s as a function of the multipliers. Having found w*, by considering the primal problem, it is also straightforward to find the optimal value for the intercept term b as

Check for yourself that this is correct 😊

Before moving on, let’s also take a more careful look at the equation involving the optimal value of w in terms of the multipliers. Suppose we’ve fit our model’s parameters to a training set, and now wish to make a prediction at a new point input x. We would then calculate wTx+b, and predict y = 1 if and only if this quantity is bigger than zero. Using the above equations, this quantity can also be written as

Hence, if we’ve found the multipliers, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between x and the points in the training set. Moreover, we saw earlier that the multiplier values will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between x and the support vectors (of which there is often only a small number) in order to make our prediction. By examining the dual form of the optimization problem, we gained significant insight into the structure of the problem and were also able to write the entire algorithm in terms of only inner products between input feature vectors.

In the next section, we will exploit this property to apply the kernels to our classification problem. The resulting algorithm, support vector machines, will be able to efficiently learn in very high dimensional spaces.

Well, in order to find the multipliers, just use any iterative-based optimization algorithm, and you are done with support vector machines as a linear classifier.

Kernels and Non-Linear SVM

Are you wondering, how the above classification is possible? Let us discuss non-linear SVM in this section.

We have already discussed the use of kernel in the first section. Now, let us look at the common types of kernels that can be used to map the input attributes to some high dimensional feature map, which in that high dimensional hyperspace becomes a linear classification problem.

As an insight consider the picture, where the data points are not linearly separable in 2D, but as the input attributes are mapped to 3D, they become linearly separable.

Types of Kernels

1. Linear Kernel:

It is the most basic type of kernel, usually one-dimensional in nature. It proves to be the best function when there are lots of features. The linear kernel is mostly preferred for classification problems as most of these kinds of classification problems can be linearly separated. Linear kernels are faster than other kernels.

2. Polynomial Kernel:

It is a more generalized representation of the linear kernel. It is not as preferred as other kernel functions as it is less efficient and accurate.

3. Gaussian Radial Basis Function (RBF):

It is one of the most preferred and used kernel functions in SVM. It is usually chosen for non-linear data. It helps to make proper separation when there is no prior knowledge of data.

Choosing the best kernel

It is quite obvious that you must be having this question about how to decide which kernel function will work efficiently for your dataset.

It totally depends on what problem you’re actually solving. If your data is linearly separable, without a second thought, go for a linear kernel as a linear kernel takes less training time when compared to other kernel functions.

Some use cases are mentioned below:

The linear kernel is mostly preferred for text classification problems as it performs well for large datasets.
RBF kernel is also a kind of Gaussian kernel that projects the high dimensional data and then searches a linear separation for it.
Polynomial kernels give good results for problems where all the training data is normalized.

Kernels and SVM

Now, it is quite simple. We learned about the kernel type, and we already know the kernels work, i.e., they map the input attribute to some high-dimensional feature map. So, previously, we were using the input attributes to train our SVM model. But in cases where the data is not linearly separable, we need to map the data into some higher dimensional hyperspace where the data becomes linearly separable. So, use a kernel with some proper function mapping to get into hyperspace where the data becomes linearly separable.

Yay!! That’s it with the concepts about Support Vector Machines and Kernels.

Now let us code a linear SVM algorithm from scratch and train a classifier.

Time to Code!

Output:

Hope you had great fun reading and implementing the algorithm from scratch.

So, revise, code, and watch out for our next article!