How to become a Machine Learning Superhero: 1. Support Vector Machines

How to become a Machine Learning Superhero: 1. Support Vector Machines

RegisterListen now
Denisa Buzan

Denisa Buzan




 min read





Event Type

Date And Time





Hosted By


No items found.

Hosted By
No items found.

Key Takeaways

In a world where technology has pressed the 'full speed' button on the scientific research that involves state-of-the-art Machine Learning models, we think it is time to pay our respect to the millennials of ML algorithms, which have paved the way for progress.

Being derived from the statistical theory of Vapnik et al. in 1992, the Support Vector Machines algorithm has been enjoying great popularity due to its data-driven and model-free nature. Its accuracy is comparable to neural networks when it comes to classification tasks, especially when we talk about small datasets. 

When it comes to real-life applications, SVM have been used in applications like: Face detection, Bioinformatics, Handwriting. In the Healthcare field, the research varies from anomaly detection applications using raw audio heartbeat sounds to diagnose neurological disorders such as epilepsy and sleep disorders or stress levels detection using EEG signals. 

Now that you are aware of the multitude of fields where SVMs can be used, we invite you to take a look at the mathematics behind them, because we should always understand an algorithm before blindly applying it. 

What is the mathematics 101 of support vector machines?

Before exploring the intricate depths of SVM’s mathematics, let’s take a look at some easy notions:

1.Lower bound

Let us recall the meaning of 'lower bound' by considering the following example:

Example of elements that are lower bounds for the subset A: 1, 2, or even 3, the first element of A. 


The element '3' can be named Infinium or the greatest lower bound because it is larger than any other lower bound. 

3.Linear Discriminant Function

Having the following graphic let us define the concept of linear function, normal, and the norm of a vector:

  • g(x) a linear function with the following mathematical formula:
  • normal vector of the hyperplane with the following:
  • ||w|| represents the norm of the vector

By definition, the Norm describes the length of the vector can be calculated using the following formula: 


Let v ℝ^2 and thus the following formula:

In Geometry the formula is similar to the Pythagorean theorem as illustrated:

What is the best way to classify your data?

Now that we are all warmed up with a mathematical 101, let us consider the following classification problem, where the data points can be classified according to the following labels: 'Earth' or 'Saturn'.

How many linear classification functions can be drawn on the following plot, in order to classify the points into two different categories?

At first glance, it looks like we could draw an infinite number of classification functions:

They all seem to do their job just fine, but still…how do we decide which one is the best out of the whole lot? 

Start working on your first superpower: Large Margin Classifier (SVM linear)

In order to find the most suitable classification function, the Support Vector Machine algorithm is going to be applied. Its superpower is given by a key concept: 'maximal margin'.

  • margin: the separation between the hyperplane and the closest data point.
  • maximal margin: the particular linear classification function for which the margin is maximized.

Let’s take a look at the mathematical theory of Support Vector Machines:

Having a set of data points {(xi,yi)} i=1,2,..n and 


By scaling w and b, we obtain:


Disclaimer: Class 'Saturn' is encoded as +1

    Class 'Earth' is encoded as -1

Given the image above, we can formulate the following:

  1. The closest points to the decision boundary are known as 'support vectors'. They are noted as ( x+, y+),( x-, y-)
  2. The margin width has to be maximized in order to have a classifier as permissive as possible. The mathematical formula for the margin width can be represented as:


In other words, the whole idea behind SVMs is to maximize M=2||w||, which is equivalent to minimizing ||w||2. As described above, the norm involves a square root. Mathematically speaking, minimizing the square root is proportional to the minimization of the quantity inside the square root.

Due to the fact that in the minimization a square root is involved, a set of conditions needs to be respected. They are the ones described in [4] and rewritten as follows: 


Should I minimize or should I maximize?

When studying SVMs, one of the biggest problems is to understand why minimization and maximization are included at the same time. The solution to this problem is given by the concept of 'duality'.

But before we get into more details, we will introduce the Lagrangian formulation.

What about Lagrangian formulation? 

When talking about optimization in the Support Vector Machines context, we should understand why Lagrange multipliers are the best available solution.

Lagrange multipliers

According to Wikipedia, we can give Lagrange multipliers the following definition :

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints

An important aspect, which is contoured in the definition above is the fact that Lagrange multipliers should use equality constraints. 

Now, every new notion is better explained through an example. 

Let us take the following optimization problem:


In this case, f(x,y) is the function that needs to be optimized according to g(x,y) constraint. The plots of the function in a 3D space and 2D are is the following:

It is important to mention that:

  • For each plot, the darker the color is, the smaller the values are.
  • The black arrows on the second plot (the 2D one) are the gradient vectors and they are perpendicular to the contour line.

The feasible set is defined by the line y=8/3-x. If it is plotted we obtain the following:

Now, according to the Lagrangian formulation, in the example [6] the function f(x,y) is minimized, having g(x,y) as constrained, when the gradient of f(x,y) points in the same direction with the gradient of g(x,y).

What does 'duality' refer to?

Now back to duality. In mathematical optimization theory, the duality means that a minimization problem can be seen as a maximization one. When the maximum of a problem is found, it becomes the lower bound for the minimization problem. The lower bound is always less than equal to the minimum of the minimization problem.


Let x be the design of a widget. The cost of producing it can be represented as a function f(x). In order to produce the widget, the company must respect the 'design constraints', noted as h(x) ≤ 0.

In order to maximize the profit, the company decides to outsource to another company. The outsourced company cannot charge more than f(x). So given a fixed 𝜆, it is important to find the design that minimizes f(x) +𝜆h(x). By varying 𝜆, the outsourced company tries to maximize its profits. 

Read more here about the mathematical description of the problem.

What about some smart words?

It’s the Primal Form of the Optimization Problem

This minimization can be solved with the help of the following Lagrangian function:

In order to solve it, we will minimize the function  on 𝑤 and b, using the corresponding partial derivatives equal to 0:

If you want to find out more about the math behind Lagrange multipliers, we encourage you to check this article.

What is the Dual Form of the Optimization Problem?

By replacing i=1naiyi in the formula of the primal form, we obtain the following:

From here, we can deduce that the optimization problem involves computing the dot products between all the points of the training set. Note that xiTxj is actually a dot product.

Most of the will turn out to have the value equal to 0 (excepting those corresponding to support vectors).

Let’s get to some action: Demo in Python

6 simple steps you can follow:

1.Import libraries and dataset

From here, we can deduce that the optimization problem involves computing the dot products between all the points of the training set. Note that xiTxj is actually a dot product.

Most of the will turn out to have the value equal to 0 (excepting those corresponding to support vectors).

2.Select inputs and output

3.Apply a PCA algorithm

4.Split in train and test set

5.Train model

6.Evaluate model

Visualize the dataset before and after the classification:

Here is the before:

And here is the better looking after:

Until next time, we’ll let you read more yourself. If you found this article useful for your journey of becoming a Machine Learning Superhero, feel free to share it with others. We hope you are a little more curious now. Curiosity can get you further than you imagine.

Follow Denisa, Victor, and the team at Linnify for more stories and articles about our favorite topics that make the world go round.


machine learning;vector;kernel;demo;python;code;data classification


No items found.


No items found.


No items found.


No items found.

Immerse yourself in a world of inspiration and innovation – be part of the action at our upcoming event

the full guide

Denisa Buzan

Denisa Buzan is a past Lead Data Scientist at Linnify.

She is truly passionate about Data Science and AI/ML Software Development, and Bioengineering. Denisa also has a MA in Applied Computational Intelligence.

Let’s build
your next digital product.

Subscribe to our newsletter