{"id":21217,"date":"2021-07-07T06:01:55","date_gmt":"2021-07-07T05:01:55","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=21217"},"modified":"2024-01-18T10:29:57","modified_gmt":"2024-01-18T09:29:57","slug":"support-vector-machines-guide","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/","title":{"rendered":"Extensive Guide to Support Vector Machines"},"content":{"rendered":"<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\">\n<div class=\"cell border-box-sizing text_cell rendered\">\n<p class=\"prompt input_prompt\">In this post we will talk about a fundamental model in machine learning \u2013 support vector machines (short: SVMs). We will take a detailed look at the theory behind SVMs and implement and train an SVM in plain Python. In particular, we will consider the two approaches for formulating SVMs: the primal and the dual approach. In the end, we will further consider the usage of kernel functions to apply SMVs to non-linear problems.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!--more--><\/p>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#What-are-support-vector-machines\" >What are support vector machines?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Primal-approach-%E2%80%93-Hard-margin-SVM\" >Primal approach \u2013 Hard-margin SVM<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Goal-1\" >Goal 1<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Goal-2\" >Goal 2<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Combined-goal\" >Combined goal<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Optional-Deriving-the-margin-equation\" >(Optional) Deriving the margin equation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Primal-approach-%E2%80%93-Soft-margin-SVM\" >Primal approach \u2013 Soft-margin SVM<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Solving-the-primal-optimization-problem\" >Solving the primal optimization problem<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Hinge-loss-function\" >Hinge loss function<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Updated-objective-function\" >Updated objective function<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Optional-Three-parts-of-the-objective-function\" >[Optional] Three parts of the objective function<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Sub-gradient-descent\" >Sub-gradient descent<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Optional-Difference-subgradient-descent-and-gradient-descent\" >[Optional] Difference subgradient descent and gradient descent<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#What-is-a-subgradient%C2%B6\" >What is a subgradient?\u00b6<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Subgradient-method\" >Subgradient method<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Implementation-primal-approach\" >Implementation primal approach<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Toy-dataset\" >Toy dataset<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#SVM-class-definition\" >SVM class definition<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Training-and-testing-an-SVM\" >Training and testing an SVM<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#5Visualizing-the-decision-boundary\" >5Visualizing the decision boundary<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-21\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Dual-approach\" >Dual approach<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-22\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Recap-Lagrange-multipliers\" >Recap Lagrange multipliers<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-23\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Recap-Lagrangian\" >Recap Lagrangian<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-24\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Dual-optimization-problem\" >Dual optimization problem<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-25\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Primal-vs-dual-approach\" >Primal vs. dual approach<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-26\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Kernels-non-linear-SVM\" >Kernels \/ non-linear SVM<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-27\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#What-is-a-kernel\" >What is a kernel?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-28\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#What-are-kernels-good-for\" >What are kernels good for?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-29\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Example\" >Example<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-30\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Can-we-also-use-kernels-in-the-primal-SVM\" >Can we also use kernels in the primal SVM?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-31\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#How-do-I-choose-the-right-kernel-for-my-problem\" >How do I choose the right kernel for my problem?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-32\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#Sources-and-further-reading\" >Sources and further reading<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2 id=\"1.-What-are-support-vector-machines?-\"><span class=\"ez-toc-section\" id=\"What-are-support-vector-machines\"><\/span>What are support vector machines?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Support vector machines (SVMs) are supervised machine learning models. They are the most prominent member of the class of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kernel_method\"><em>kernel methods<\/em><\/a>. SVMs can be used both for classification and regression. The original SVM proposed in 1963 is a simple binary linear classifier. What does this mean?<\/p>\n<p>Assume we are given a dataset \\(D = \\big \\{ \\mathbf{x}_n, y_n \\big \\}_{n=1}^N\\), where \\(\\pmb{x}_n \\in \\mathbb{R}^D\\) and labels \\(y_n \\in \\{-1, +1 \\}\\). A linear (hard-margin) SVM separates the two classes using a (\\(D-1\\) dimensional) hyperplane.<\/p>\n<p>Special to SVMs is that they use not any hyperplane but the one that maximizes the distance between itself and the two sets of datapoints. Such a hyperplane is called <em>maximum-margin<\/em> hyperplane:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21224\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes.png\" alt=\"Two graphs, one showing possible separating hyperplanes, the other showing the maximum margin hyperplane\" width=\"815\" height=\"367\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes.png 1554w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-300x135.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-1024x461.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-768x346.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-1536x692.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-400x180.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/separating_hyperplanes-360x162.png 360w\" sizes=\"auto, (max-width: 815px) 100vw, 815px\" \/><\/p>\n<p>In case you have never heard the term margin: the margin describes the distance between the hyperplane and the closest examples in the dataset.<\/p>\n<p>Two types of SVMs exist: primals SVMs and dual SVMs. Although most research in the past looked into dual SVMs both can be used to perform non-linear classification. Therefore, we will look at both approaches and compare them in the end.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"2.-Primal-approach---Hard-margin-SVM--\"><span class=\"ez-toc-section\" id=\"Primal-approach-%E2%80%93-Hard-margin-SVM\"><\/span>Primal approach \u2013 Hard-margin SVM<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When training an SVM our goal is to find the hyperplane that maximizes the margin between the two sets of points. This hyperplane is fully defined by the points closest to the margin, which are also called <em>support vectors<\/em>.<\/p>\n<p>The equation of a hyperplane is given by \\(\\langle \\pmb{w}, \\pmb{x} \\rangle + b = 0\\) , where \\(\\langle \\cdot, \\cdot \\rangle\\) denotes the inner product and \\(\\mathbf {w}\\) is the normal vector to the hyperplane. If an example \\(\\pmb{x}_i\\) lies on the right side of the hyperplane (that is, it has a positive label) we have \\(\\langle \\pmb{w}, \\pmb{x}_i \\rangle + b \\gt 0\\). If instead \\(\\pmb{x}_i\\) lies on the left side (= negative label) we have \\(\\langle \\pmb{w}, \\pmb{x}_i \\rangle + b \\lt 0\\).<\/p>\n<p>The support vectors lie exactly on the margin and the optimal separating hyperplane should have the same distance from all support vectors. In this sense the maximum margin hyperplane lies between two separating hyperplanes that are determined by the support vectors:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21218\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes.png\" alt=\"Graph with one vector und two support vectors\" width=\"487\" height=\"428\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes.png 892w, https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes-300x264.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes-768x675.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes-400x352.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/delimiting_hyperplanes-360x316.png 360w\" sizes=\"auto, (max-width: 487px) 100vw, 487px\" \/><\/p>\n<h3 id=\"Goal-1:\"><span class=\"ez-toc-section\" id=\"Goal-1\"><\/span>Goal 1<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>When deriving a formal equation for the maximum margin hyperplane we assume that the two delimiting hyperplanes are given by:<br \/>\n$$\\langle \\pmb{w}, \\pmb{x}_{+} \\rangle + b = +1$$<br \/>\n$$\\langle \\pmb{w}, \\pmb{x}_{-} \\rangle + b = -1$$<\/p>\n<p>In other words: we want our datapoints to lie at least a distance of 1 away from the decision hyperplane into both directions. To be more precise: for our positive examples (those with label \\(y_n = +1\\)) we want the following to hold: \\(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b \\ge +1\\).<\/p>\n<p>For our negative examples (those with label \\(y_n = -1\\)) we want the opposite: \\(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b \\le -1\\). This can be combined into a single equation: \\(y_n(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b) \\ge 1\\). This is our first goal: <strong>we want a decision boundary that classifies our training examples correctly.<\/strong><\/p>\n<h3 id=\"Goal-2:\"><span class=\"ez-toc-section\" id=\"Goal-2\"><\/span>Goal 2<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Our second goal is to maximize the margin of this decision boundary. The margin is given by \\(\\frac{1}{\\pmb{w}}\\). If you would like to understand where this value is coming from take a look at the section &#8222;<em>(Optional) Deriving the margin equation<\/em>&#8220; below.<\/p>\n<p>Our goal to maximize the margin can be expressed as follows:<br \/>\n$$ \\max_{\\pmb{w}, b} \\frac{1}{\\Vert \\pmb{w} \\Vert}$$<\/p>\n<p>Instead of maximizing \\(\\frac{1}{\\Vert \\pmb{w} \\Vert}\\) we can instead minimize \\(\\frac{1}{2} \\Vert \\pmb{w} \\Vert^2\\). This simplifies the computation of the gradient.<\/p>\n<h3 id=\"Combined-goal\"><span class=\"ez-toc-section\" id=\"Combined-goal\"><\/span>Combined goal<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Combining goal one and two yields the following objective function:<br \/>\n$$<br \/>\n\\min_{\\pmb{w}, b} \\frac{1}{2} \\Vert \\pmb{w} \\Vert^2 \\\\<br \/>\n\\text{subject to: } y_n(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b) \\ge 1 \\text{ for all } n = 1, &#8230;, N<br \/>\n$$<\/p>\n<p>In words: we want to find the values for \\(\\pmb{w}\\) and \\(b\\) that maximize the margin while classifying all training examples correctly. This approach is called the <em>hard-margin support vector machine<\/em>. &#8222;Hard&#8220; because it does not allow for violations of the margin requirement (= no points are allowed to be within the margin).<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"(Optional)-Deriving-the-margin-equation-\"><span class=\"ez-toc-section\" id=\"Optional-Deriving-the-margin-equation\"><\/span>(Optional) Deriving the margin equation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We can derive the width of the margin in several ways (see sections 12.2.1-12.2.2 of the <a href=\"https:\/\/mml-book.com\">Mathematics for Machine Learning book<\/a>). Personally, I found the explanation of <a href=\"https:\/\/www.youtube.com\/watch?v=_PwhiWxHK8o\">this MIT lecture on SVMs<\/a> easiest to understand.<\/p>\n<p>The derivation of the margin is based on the assumptions that we have already noted above:<br \/>\n$$\\langle \\pmb{w}, \\pmb{x}_{+} \\rangle + b = +1$$<br \/>\n$$\\langle \\pmb{w}, \\pmb{x}_{-} \\rangle + b = -1$$<\/p>\n<p>Including the label of each example we can rewrite this as<br \/>\n$$y_i (\\langle \\pmb{w}, \\pmb{x}_{i} \\rangle + b) -1 = 0$$<\/p>\n<p>Let&#8217;s say we have a positive example \\(\\pmb{x}_{+}\\) that lies on the right delimiting hyperplane and a negative example \\(\\pmb{x}_{-}\\) that lies on the left delimiting hyperplane. The distance between these two vectors is given by (\\(\\pmb{x}_{+} &#8211; \\pmb{x}_{-})\\). We want to compute the orthogonal projection of the vector onto the line that is perpendicular to the decision hyperplane. This would give us the width between the two delimiting hyperplanes. We can compute this by multiplying the vector (\\(\\pmb{x}_{+} &#8211; \\pmb{x}_{-})\\) with a vector that is perpendicular to the hyperplane. We know that the vector \\(\\pmb{w}\\) is perpendicular to the decision hyperplane. So we can compute the margin by multiplying \\((\\pmb{x}_{+} &#8211; \\pmb{x}_{-})\\) with the vector \\(\\pmb{w}\\) where the latter is divided by the scale \\(||\\pmb{w}||\\) to make it a unit vector.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21222\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/maximum_margin_derivation.png\" alt=\"Graph with formulas for the distance of the two showed vectors\" width=\"462\" height=\"414\" \/><\/p>\n<p>$$<br \/>\n\\begin{align}<br \/>\n\\text{width} &amp;= (\\pmb{x}_{+} &#8211; \\pmb{x}_{-}) \\cdot \\frac{\\pmb{w}}{||\\pmb{w}||} \\\\<br \/>\n&amp;= \\frac{\\pmb{x}_{+} \\cdot \\pmb{w}}{||\\pmb{w}||} &#8211; \\frac{\\pmb{x}_{-} \\cdot \\pmb{w}}{||\\pmb{w}||}<br \/>\n\\end{align}<br \/>\n$$<\/p>\n<p>For the positive example \\(\\pmb{x}_{+}\\) we have \\(y_+ = +1\\) and therefore \\((\\langle \\pmb{w}, \\pmb{x}_{+} \\rangle = 1 &#8211; b\\). For the negative example \\(\\pmb{x}_{-}\\) we have \\(y_- = -1\\) and therefore \\(&#8211; (\\langle \\pmb{w}, \\pmb{x}_{-} \\rangle) = 1 + b\\):<\/p>\n<p>$$<br \/>\n\\begin{align}<br \/>\n\\text{width} &amp;= (\\pmb{x}_{+} &#8211; \\pmb{x}_{-}) \\cdot \\frac{\\pmb{w}}{||\\pmb{w}||} \\\\<br \/>\n&amp;= \\frac{\\pmb{x}_{+} \\cdot \\pmb{w}}{||\\pmb{w}||} &#8211; \\frac{\\pmb{x}_{-} \\cdot \\pmb{w}}{||\\pmb{w}||} \\\\<br \/>\n&amp;= \\frac{(1 &#8211; b) + (1 + b)}{||\\pmb{w}||}\\\\<br \/>\n&amp;= \\frac{2}{||\\pmb{w}||}\\\\<br \/>\n\\end{align}<br \/>\n$$<\/p>\n<p>We conclude that the width between the two delimiting hyperplanes equals \\(\\frac{2}{\\pmb{w}}\\). And therefore, that the distance between the decision hyperplane and each delimiting hyperplane is \\(\\frac{1}{\\pmb{w}}\\).<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"3.-Primal-approach---Soft-margin-SVM-\"><span class=\"ez-toc-section\" id=\"Primal-approach-%E2%80%93-Soft-margin-SVM\"><\/span>Primal approach \u2013 Soft-margin SVM<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In most real-world situations the available data is not linearly separable. Even if it is, we might prefer a solution which separates the data well while ignoring some noisy examples and outliers. This motivated an extension of the original hard-margin SVM called <em>soft-margin SVM<\/em>.<\/p>\n<p>A soft-margin SVM allows for violations of the margin requirement (= classification errors). In other words: not all training examples need to be perfectly classified. They might fall within the margin or even lie on the wrong side of the decision hyperplane. However, such violations are not for free. We pay a cost for each violation, where the value of the cost depends on how far the example is from meeting the margin requirement.<\/p>\n<p>To implement this, we introduce so called <em>slack variables<\/em> \\(\\xi_n\\). Each training example \\((\\pmb{x}_n, y_n)\\) is assigned a slack variable \\(\\xi_n \\ge 0\\). The slack variable allows this example to be within the margin or even on the wrong side of the decision hyperplane:<\/p>\n<ul>\n<li>If \\(\\xi_n = 0\\) the training example \\((\\pmb{x}_n, y_n)\\) lies exactly on the margin<\/li>\n<li>\\(0 \\lt \\xi_n \\lt 1\\) the training example lies within the margin but on the correct side of the decision hyperplane<\/li>\n<li>\\(\\xi_n \\ge 1\\) the training example lies on the wrong side of the decision hyperplane<\/li>\n<\/ul>\n<p>We extend our objective function to include the slack variables as follows:<br \/>\n$$ \\min_{\\pmb{w}, b, \\pmb{\\xi}} \\frac{1}{2} \\Vert \\pmb{w} \\Vert^2 + C \\sum_{n=1}^N \\xi_n $$<\/p>\n<p>$$ \\text{subject to:} $$<\/p>\n<p>$$ y_n(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b) \\ge 1 &#8211; \\xi_n $$$$ \\xi_i \\ge 0 \\text{ for all } n = 1, &#8230;, N $$<\/p>\n<p>The parameter \\(C\\) is a regularization term that controls the trade-off between maximizing the margin and minimizing the training error (which in turn means classifying all training examples correctly). If the value of \\(C\\) is small, we care more about maximizing the margin than classifying all points correctly. If the value of \\(C\\) is large, we care more about classifying all points correctly than maximizing the margin.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"4.-Solving-the-primal-optimization-problem-\"><span class=\"ez-toc-section\" id=\"Solving-the-primal-optimization-problem\"><\/span>Solving the primal optimization problem <a id=\"solving-primal-svm\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Theoretically, the primal SVM can be solved in multiple ways. The most well known way is to use the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hinge_loss\">hinge loss function<\/a> together with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Subgradient_method\">subgradient descent<\/a>.<\/p>\n<h3 id=\"4.1-Hinge-loss-function-\"><span class=\"ez-toc-section\" id=\"Hinge-loss-function\"><\/span>Hinge loss function<a id=\"hinge-loss\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The hinge loss function, given the true target \\(y \\in \\{-1, +1\\}\\) and the prediction \\(f(\\pmb{x}) = \\langle\\pmb{w}, \\pmb{x}\\rangle+b\\), is computed as follows:<\/p>\n<p>$$\\ell(t)=\\max \\{0,1-t\\} \\quad \\text{where} \\quad t=y \\cdot f(\\pmb{x})= y \\cdot \\big(\\langle\\pmb{w}, \\pmb{x}\\rangle+b\\big)$$<\/p>\n<p>Let&#8217;s understand the output of this loss function with a few examples:<\/p>\n<ul>\n<li>If a training example has label \\(y = -1\\) and the prediction is on the correct side of the marghin (that is, \\(f(\\pmb{x}) \\le -1\\)), the value of \\(t\\) is larger or equal to \\(+1\\). Therefore, the hinge loss will be zero (\\(\\ell(t) = 0\\))<\/li>\n<li>The same holds if a training example has label \\(y = 1\\) and the prediction is on the correct side of the margin (that is, \\(f(\\pmb{x}) \\ge 1\\))<\/li>\n<li>If a training example (\\(y = 1\\)) is on the correct side of the decision hyperplane but lies within the margin (that is, \\(0 \\lt f(\\pmb{x}) \\lt 1\\)) the hinge loss will output a positive value.<\/li>\n<li>If a training example (\\(y = 1\\)) is on the wrong side of the decision hyperplane (that is, \\(f(\\pmb{x}) \\lt 0\\)), the hinge loss returns an even larger value. This value increases linearly with the distance from the decision hyperplane<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21221 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/hinge_loss.png\" alt=\"Hinge loss Function\" width=\"640\" height=\"480\" \/><\/p>\n<h3 id=\"4.2-Updated-objective-function-\"><span class=\"ez-toc-section\" id=\"Updated-objective-function\"><\/span>Updated objective function <a id=\"updated-objective-function\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Using the hinge loss, we can reformulate the optimization problem of the primal soft-margin SVM. Given a dataset \\(D = \\big \\{ \\mathbf{x}_n, y_n \\big \\}_{n=1}^N\\) we would like to minimize the total loss which is now given by:<\/p>\n<p>$$<br \/>\n\\min _{\\pmb{w}, b} \\frac{1}{2}\\|\\pmb{w}\\|^{2} + C \\sum_{n=1}^{N} \\max \\left\\{0,1-y_{n}\\left(\\left\\langle\\pmb{w}, \\pmb{x}_{n}\\right\\rangle+b\\right)\\right\\}<br \/>\n$$<\/p>\n<p>If you would like to understand why this is equivalent to our previous formulation of the soft-margin SVM, please take a look at chapter 12.2.5 of the <a href=\"https:\/\/mml-book.com\">Mathematics for Machine Learning book<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"[Optional]-Three-parts-of-the-objective-function-\"><span class=\"ez-toc-section\" id=\"Optional-Three-parts-of-the-objective-function\"><\/span>[Optional] Three parts of the objective function <a id=\"two-parts\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Our objective function can be divided into three distinct parts:<\/p>\n<p>Part 1: \\(\\frac{1}{2}\\|\\pmb{w}\\|^{2}\\)<\/p>\n<p>This part is also called the <em>regularization term<\/em>. It expresses a preference for solutions that separate the datapoints well, thereby maximizing the margin. In theory, we could replace this term by a different regularization term that expresses a different preference.<\/p>\n<p>Part 2: \\(\\sum_{n=1}^{N} \\max \\left\\{0,1-y_{n}\\left(\\left\\langle\\pmb{w}, \\pmb{x}_{n}\\right\\rangle+b\\right)\\right\\}\\)<\/p>\n<p>This part is also called the <em>empirical loss<\/em>. In our case it is the hinge loss which penalizes solutions that make mistakes when classifying the training examples. In theory, this term could be replaced with another loss function that expresses a different preference.<\/p>\n<p>Part 3: The hyperparameter \\(C\\) that controls the tradeoff between a large margin and a small hinge loss.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"4.3-Sub-gradient-descent-\"><span class=\"ez-toc-section\" id=\"Sub-gradient-descent\"><\/span>Sub-gradient descent <a id=\"subgradient-descent\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The hinge loss function is not differentiable (namely at the point \\(t=1\\)). Therefore, we cannot compute the gradient right away. However, we can use a method called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Subgradient_method\">subgradient descent<\/a> to solve our optimization problem. To simplify the derivation we will adapt two things:<\/p>\n<ol>\n<li>We assume that the bias \\(b\\) is contained in our weight vector as the first entry \\(w_0\\), that is \\(\\pmb{w} = [b, w_1, &#8230;, w_D]\\)<\/li>\n<li>We divide the hinge loss by the number of samples<\/li>\n<\/ol>\n<p>Our cost function is then given by:<br \/>\n$$<br \/>\nJ(\\pmb{w}) = \\frac{1}{2}\\|\\pmb{w}\\|^{2} + C \\frac{1}{N} \\sum_{n=1}^{N} \\max \\left\\{0,1-y_{n}\\left(\\left\\langle\\pmb{w}, \\pmb{x}_{n}\\right\\rangle\\right)\\right\\}<br \/>\n$$<\/p>\n<p>We will reformulate this to simplify computing the gradient:<br \/>\n$$<br \/>\nJ(\\pmb{w}) = \\frac{1}{N} \\sum_{n=1}^{N} \\Big[ \\frac{1}{2}\\|\\pmb{w}\\|^{2} + C \\max \\left\\{0,1-y_{n}\\left(\\left\\langle\\pmb{w}, \\pmb{x}_{n}\\right\\rangle\\right)\\right\\}\\Big]<br \/>\n$$<\/p>\n<p>The gradient is given by:<br \/>\n$$<br \/>\n\\nabla_{w} J(\\pmb{w}) = \\frac{1}{N} \\sum_{n=1}^N \\left\\{\\begin{array}{ll}<br \/>\n\\mathbf{w} &amp; \\text{if} \\max \\left(0,1-y_{n} \\left(\\langle \\pmb{w}, \\pmb{x}_{n} \\rangle \\right)\\right)=0 \\\\<br \/>\n\\mathbf{w}-C y_{n} \\pmb{x}_{n} &amp; \\text { otherwise }<br \/>\n\\end{array}\\right.<br \/>\n$$<\/p>\n<p>With this formula we can apply stochastic gradient descent to solve the optimization problem.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"[Optional]-Difference-subgradient-descent-and-gradient-descent-\"><span class=\"ez-toc-section\" id=\"Optional-Difference-subgradient-descent-and-gradient-descent\"><\/span>[Optional] Difference subgradient descent and gradient descent <a id=\"subgradient-descent=vs-gradient-descent\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The subgradient method allows us to minimize a non-differentiable convex function. Although looking similar to gradient descent the method has several important differences.<\/p>\n<h4 id=\"What-is-a-subgradient?\"><span class=\"ez-toc-section\" id=\"What-is-a-subgradient%C2%B6\"><\/span>What is a subgradient?<a class=\"anchor-link\" href=\"#What-is-a-subgradient?\">\u00b6<\/a><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>A subgradient can be described as a generalization of gradients to non-differentiable functions. Informally, a sub-tangent at a point is any line that lies below the function at the point. The subgradient is the slope of this line. Formally, the subgradient of a convex function \\(f\\) at \\(w_0\\) is defined as all vectors \\(g\\) such that for any other point \\(w\\)<\/p>\n<p>$$ f(w) &#8211; f(w_0) \\ge g \\cdot (w &#8211; w_0) $$<\/p>\n<p>If \\(f\\) is differentiable at \\(w_0\\), the subgradient contains only one vector which equals the gradient \\(\\nabla f(w_0)\\). If, however, \\(f\\) is not differentiable, there may be several values for \\(g\\) that satisfy this inequality. This is illustrated in the figure below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21220\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient.png\" alt=\"Two graphs showing gradient vs subgradient\" width=\"812\" height=\"357\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient.png 1254w, https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient-300x132.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient-1024x450.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient-768x337.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient-400x176.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/gradient_vs_subgradient-360x158.png 360w\" sizes=\"auto, (max-width: 812px) 100vw, 812px\" \/><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h4 id=\"Subgradient-method\"><span class=\"ez-toc-section\" id=\"Subgradient-method\"><\/span>Subgradient method<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>To minimize the objective function \\(f\\), the subgradient method uses the following update formula for iteration \\(k+1\\):<\/p>\n<p>$$ w^{(k+1)} = w^{(k)} &#8211; \\alpha_k g^{(k)}$$<\/p>\n<p>Where \\(g^{(k)}\\) is <em>any<\/em> subgradient of \\(f\\) at \\(w^{(k)}\\) and \\(\\alpha_k\\) is the \\(k\\)-th step size. Thus, at each iteration, we make a step into the direction of the negative subgradient. When \\(f\\) is differentiable, \\(g^{(k)}\\) equals the gradient \\(\\nabla f(x^{(k)})\\) and the method reduces to the standard gradient descent method.<\/p>\n<p>More details on the subgradient method can be found <a href=\"https:\/\/web.stanford.edu\/class\/ee364b\/lectures\/subgrad_method_notes.pdf\">here<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"5.-Implementation-primal-approach-\"><span class=\"ez-toc-section\" id=\"Implementation-primal-approach\"><\/span>Implementation primal approach <a id=\"primal-implementation\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3 id=\"5.1-Toy-dataset-\"><span class=\"ez-toc-section\" id=\"Toy-dataset\"><\/span>Toy dataset <a id=\"toy-dataset\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>To implement what we have learned about primal SVMs, we first have to generate a dataset. In the cell below we create a simple dataset with two features and labels +1 and -1. We further split the dataset into a test and train set.<\/p>\n<pre class=\"lang:python decode:true \">from sklearn.datasets import make_blobs\r\nfrom sklearn.model_selection import train_test_split\r\nimport matplotlib.pyplot as plt\r\n\r\nfeatures, targets = make_blobs(n_samples=600, centers=2, n_features=2, random_state=42)\r\n\r\n# The function outputs targets 0 and 1 so we need to convert targets 0 to -1\r\ntransformed_targets = [-1 if t == 0 else +1 for t in targets]\r\n        \r\n# Plot the dataset\r\nplt.figure(figsize=(8, 6))\r\nplt.scatter(features[:, 0], features[:, 1], c = transformed_targets)\r\nplt.title(\"Toy dataset\")\r\nplt.ylabel(\"Feature 2\")\r\nplt.xlabel(\"Feature 1\")\r\nplt.show()\r\n\r\n# Split data into training and test set\r\nfeatures_train, features_test, labels_train, labels_test = train_test_split(features, \r\n                                                                              transformed_targets, \r\n                                                                              test_size = 0.3)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21227 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/dataset.png\" alt=\"Toy data set with Feature 2 on the y-axis and feature 1 on the x-axis\" width=\"576\" height=\"432\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/dataset.png 576w, https:\/\/www.inovex.de\/wp-content\/uploads\/dataset-300x225.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/dataset-400x300.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/dataset-360x270.png 360w\" sizes=\"auto, (max-width: 576px) 100vw, 576px\" \/><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"5.2-SVM-class-definition-\"><span class=\"ez-toc-section\" id=\"SVM-class-definition\"><\/span>SVM class definition <a id=\"primal-svm-class\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Next, we would like to implement an SVM class. We will use the knowledge we already acquired:<\/p>\n<ol>\n<li>Our objective function using the hinge loss function is given by:<br \/>\n$$<br \/>\nJ(\\pmb{w}) = \\frac{1}{2}\\|\\pmb{w}\\|^{2} + C \\frac{1}{N} \\sum_{n=1}^{N} \\max \\left\\{0,1-y_{n}\\left(\\left\\langle\\pmb{w}, \\pmb{x}_{n}\\right\\rangle\\right)\\right\\}<br \/>\n$$<\/li>\n<li>We can minimize this function by computing the gradient:<br \/>\n$$<br \/>\n\\nabla_{w} J(\\pmb{w}) = \\frac{1}{N} \\sum_{n=1}^N \\left\\{\\begin{array}{ll}<br \/>\n\\mathbf{w} &amp; \\text{if} \\max \\left(0,1-y_{n} \\left(\\langle \\pmb{w}, \\pmb{x}_{n} \\rangle \\right)\\right)=0 \\\\<br \/>\n\\mathbf{w}-C y_{n} \\pmb{x}_{n} &amp; \\text { otherwise }<br \/>\n\\end{array}\\right.<br \/>\n$$<\/li>\n<li>Given the gradient we use stochastic gradient descent to train our model<\/li>\n<li>After training our model we can make predictions using the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sign_function\">sign function<\/a><\/li>\n<\/ol>\n<p>As mentioned previously, we will assume that the bias \\(b\\) is contained in our weight vector as the first entry \\(w_0\\), that is \\(\\pmb{w} = [b, w_1, &#8230;, w_D] = \\pmb{w} = [w_0, w_1, &#8230;, w_D]\\).<\/p>\n<pre class=\"lang:python decode:true \">import numpy as np\r\nfrom sklearn.utils import shuffle\r\n\r\nclass LinearSVM:\r\n    \r\n    def __init__(self, regularization_param):\r\n        \"\"\"\r\n        Initialize the model by setting the regularization parameter \r\n        and a boolean variable for our trained weights.\r\n        \"\"\"\r\n        self.regularization_param = regularization_param\r\n        self.trained_weights = None\r\n    \r\n    def add_bias_term(self, features):\r\n        \"\"\"\r\n        Add intercept 1 to each training example for bias b\r\n        \"\"\"\r\n        n_samples = features.shape[0]\r\n        ones = np.ones((n_samples, 1))\r\n        return np.concatenate((ones, features), axis=1)\r\n    \r\n    def compute_cost(self, weights, features, labels) -&gt; float:\r\n        \"\"\"\r\n        Compute the value of the cost function\r\n        \"\"\"\r\n        n_samples = features.shape[0]\r\n        \r\n        # Compute hinge loss \r\n        predictions = np.dot(features, weights).flatten()\r\n        distances = 1 - labels * predictions\r\n        hinge_losses = np.maximum(0, distances)\r\n\r\n        # Compute sum of the individual hinge losses\r\n        sum_hinge_loss = np.sum(hinge_losses) \/ n_samples\r\n\r\n        # Compute entire cost\r\n        cost = (1 \/ 2) * np.dot(weights.T, weights) + self.regularization_param * sum_hinge_loss\r\n        \r\n        return float(cost)\r\n    \r\n    def compute_gradient(self, weights, features, labels) -&gt; np.ndarray:\r\n        \"\"\"\r\n        Compute the gradient, needed for training\r\n        \"\"\"\r\n        predictions = np.dot(features, weights)\r\n        distances = 1 - labels * predictions\r\n        n_samples, n_feat = features.shape\r\n        sub_gradients = np.zeros((1, n_feat))\r\n\r\n        for idx, dist in enumerate(distances):\r\n            if max(0, dist) == 0:\r\n                sub_gradients += weights.T\r\n            else:\r\n                sub_grad = weights.T - (self.regularization_param * features[idx] * labels[idx])\r\n                sub_gradients += sub_grad\r\n                            \r\n        # Sum up and divide by the number of samples\r\n        avg_gradient = sum(sub_gradients) \/ len(labels)\r\n\r\n        return avg_gradient\r\n    \r\n    def train(self, train_features, train_labels, n_epochs, learning_rate=0.01, batch_size=1):\r\n        \"\"\"\r\n        Train the model with stochastic gradient descent using the\r\n        specified number of epochs, learning rate and batch size.\r\n        \"\"\"\r\n        # Add bias term to features\r\n        train_features = self.add_bias_term(train_features)\r\n        \r\n        # Initalize weight vector\r\n        n_samples, n_feat = train_features.shape\r\n        weights = np.zeros(n_feat)[:, np.newaxis]\r\n        \r\n        # Train the model for a certain number of epochs\r\n        for epoch in range(n_epochs):\r\n            features, labels = shuffle(train_features, train_labels)\r\n            features, labels = train_features, train_labels\r\n            start, end = 0, batch_size\r\n            while end &lt;= len(labels): # Training loop over the dataset\r\n                batch = features[start:end]\r\n                batch_labels = labels[start:end]\r\n                \r\n                grad = self.compute_gradient(weights, batch, batch_labels)\r\n                update = (learning_rate * grad)[:, np.newaxis]\r\n                weights = weights - update\r\n                start, end = end, end + batch_size\r\n                \r\n            current_cost = self.compute_cost(weights, features, labels)\r\n            print(f\"Epoch {epoch + 1}, cost: {current_cost}\")\r\n                \r\n        # Set the trained weights to allow making predictions\r\n        self.trained_weights = weights\r\n\r\n    def predict(self, test_features) -&gt; np.ndarray:\r\n        \"\"\"\r\n        Predict labels for new test features.\r\n        Raises ValueError if model has not been trained yet.\r\n        \"\"\"\r\n        test_features = self.add_bias_term(test_features)\r\n        if self.trained_weights is None:\r\n            raise ValueError(\"You haven't trained the SVM yet!\")\r\n            \r\n        predicted_labels = []\r\n        n_samples = test_features.shape[0]\r\n        for idx in range(n_samples):\r\n            prediction = np.sign(np.dot(self.trained_weights.T, test_features[idx]))\r\n            predicted_labels.append(prediction)\r\n            \r\n        return np.array(predicted_labels)<\/pre>\n<pre class=\"lang:python decode:true \"># Compute some values to make sure the cost is computed correctly\r\n# I calculated the values for this example by hand first\r\nsvm = LinearSVM(regularization_param=1)\r\nweights = np.array([1, 2])[:, np.newaxis]\r\nfeatures = np.array([[0.5], [2.5]])\r\nnew_features = svm.add_bias_term(features)\r\n\r\nlabels = np.array([-1, +1])\r\nassert svm.compute_cost(weights, new_features, labels) == 4.\r\ngradient = svm.compute_gradient(weights, new_features, labels[:, np.newaxis])<\/pre>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\"><\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"5.3-Training-and-testing-an-SVM-\"><span class=\"ez-toc-section\" id=\"Training-and-testing-an-SVM\"><\/span>Training and testing an SVM <a id=\"train-test-svm\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>After defining our SVM class, we can train a model and test it on unseen examples.<\/p>\n<pre class=\"lang:python decode:true\">regularization_param = 100\r\nlr = 0.000001\r\nsvm = LinearSVM(regularization_param)\r\ntrained_weights = svm.train(features_train, labels_train, n_epochs=10, learning_rate=lr)<\/pre>\n<div class=\"output_subarea output_text output_stream output_stdout\">\n<pre class=\"\">Epoch 1, cost: 28.271046694444802\r\nEpoch 2, cost: 8.08309520023087\r\nEpoch 3, cost: 3.7258493516236966\r\nEpoch 4, cost: 2.3694576692644267\r\nEpoch 5, cost: 1.8114365548737537\r\nEpoch 6, cost: 1.4913665755694478\r\nEpoch 7, cost: 1.267043029568673\r\nEpoch 8, cost: 1.099125323933344\r\nEpoch 9, cost: 0.9904881031101147\r\nEpoch 10, cost: 0.9025115077980943\r\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<pre class=\"lang:python decode:true\"># Predict lables for unknown test samples\r\nfrom sklearn.metrics import accuracy_score, recall_score, precision_score\r\n\r\npredicted_labels = svm.predict(features_test)\r\npredicted_labels = predicted_labels.flatten()\r\n\r\nprint(\"Accuracy on test dataset: {}\".format(accuracy_score(labels_test, predicted_labels)))\r\nprint(\"Recall on test dataset: {}\".format(recall_score(labels_test, predicted_labels)))\r\nprint(\"Precision on test dataset: {}\".format(recall_score(labels_test, predicted_labels)))    \r\n\r\n&gt;&gt;&gt; Accuracy on test dataset: 1.0\r\n&gt;&gt;&gt; Recall on test dataset: 1.0 \r\n&gt;&gt;&gt; Precision on test dataset: 1.0<\/pre>\n<p>&nbsp;<\/p>\n<h3 id=\"5.4-Visualizing-the-decision-boundary-\"><span class=\"ez-toc-section\" id=\"5Visualizing-the-decision-boundary\"><\/span>5Visualizing the decision boundary<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Given our trained model, we can visualize the decision boundary, as done below.<\/p>\n<pre class=\"lang:python decode:true \">import numpy as np\r\n\r\n# Create dataset for visualization\r\nsize=40000\r\nfeat_1 = np.random.uniform(low=-7, high=8, size=size)\r\nfeat_2 = np.random.uniform(low=-2, high=14, size=size)\r\nfeatures_vis = np.column_stack((feat_1, feat_2))\r\n\r\nlabels_vis = svm.predict(features_vis)\r\n\r\n# Plot the decision boundary\r\nplt.figure(figsize=(7, 5))\r\nplt.scatter(features_vis[:, 0], features_vis[:, 1], c = labels_vis)\r\n\r\n# Plot original dataset\r\npositive_samples = [idx for idx in range(len(transformed_data_targets)) if transformed_data_targets[idx] == +1]\r\nnegative_samples = [idx for idx in range(len(transformed_data_targets)) if transformed_data_targets[idx] == -1]\r\nplt.scatter(data_features[positive_samples, 0],\r\n            data_features[positive_samples, 1],\r\n            c=\"yellow\", label=\"Original positive samples\")\r\nplt.scatter(data_features[negative_samples, 0],\r\n            data_features[negative_samples, 1],\r\n            c=\"purple\", label=\"Original negative samples\")\r\n\r\nplt.title(\"SVM decision boundary\")\r\nplt.ylabel(\"Feature 2\")\r\nplt.xlabel(\"Feature 1\")\r\nplt.legend(loc=2)\r\nplt.show()<\/pre>\n<p>&nbsp;<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing code_cell rendered\">\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\u00a0<img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21385 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/svm_decision_boundary-2.png\" alt=\"SVM decision boundary shown on a graph. Original negative samples on the left and Original positive samples one the right. \" width=\"457\" height=\"328\" \/><\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"6.-Dual-approach-\"><span class=\"ez-toc-section\" id=\"Dual-approach\"><\/span>Dual approach <a id=\"dual-approach\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In the previous sections we took a detailed look at the primal SVM. To solve a primal SVM, we need to find the best values for the weights and bias. Recall that our input examples \\(\\pmb{x} \\in \\mathbb{R}^D\\) have \\(D\\) features. Consequently, our weights \\(\\pmb{w}\\) have \\(D\\) features, too. This can become problematic if the number of features \\(D\\) is large.<\/p>\n<p>That is where the second way of formalizing SVMs (called <em>dual approach<\/em>) comes in handy. The optimization problem of the dual approach is independent of the number of features. Instead, the number of parameters increases with the number of examples in the training set.<\/p>\n<p>The dual approach uses the method of Lagrange multipliers. Lagrange multipliers allow us to find the minimum or maximum of a function if there are one or more constraints on the input values we are allowed to used.<\/p>\n<p>If you never heard of Lagrange multipliers I can recommend the <a href=\"https:\/\/www.khanacademy.org\/math\/multivariable-calculus\/applications-of-multivariable-derivatives\/constrained-optimization\/a\/lagrange-multipliers-single-constraint\">blog posts<\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=yuqB-d5MjZA&amp;list=PLCg2-CTYVrQvNGLbd-FN70UxWZSeKP4wV&amp;index=1\">video tutorials<\/a> on the topic from Khan Academy.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"6.1-Recap-Lagrange-multipliers-\"><span class=\"ez-toc-section\" id=\"Recap-Lagrange-multipliers\"><\/span>Recap Lagrange multipliers <a id=\"recap-lagrange-multipliers\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Lagrange multipliers allow us to solve constrained optimization problems. Let&#8217;s say we want to maximize the function \\(f(x, y) = 2x + y\\) under the constraint that our values of \\(x\\) and \\(y\\) satisfy the following equation: \\(g(x, y) := x^2 + y^2 = 1\\). This constraint equation describes a circle of radius 1.<\/p>\n<p>The key insight behind the solution to this problem is that we need to find those values for \\(x\\) and \\(y\\) where the gradients of \\(f\\) and \\(g\\) are aligned. This can be expressed using a Lagrange multiplier (typically \\(\\lambda\\)):<\/p>\n<p>We want to find those values \\(x_m, y_m\\) where \\(\\nabla f(x_m, y_m) = \\lambda \\nabla g(x_m, y_m)\\).<\/p>\n<p>In our example the gradient vectors look as follows:<br \/>\n$$ \\nabla f(x, y)=\\left[\\begin{array}{c}<br \/>\n\\frac{\\partial}{\\partial x}(2 x+y) \\\\<br \/>\n\\frac{\\partial}{\\partial y}(2 x+y)<br \/>\n\\end{array}\\right]=\\left[\\begin{array}{c}<br \/>\n2 \\\\ 1<br \/>\n\\end{array}\\right]$$<\/p>\n<p>$$ \\nabla g(x, y)=\\left[\\begin{array}{c}<br \/>\n\\frac{\\partial}{\\partial x}\\left(x^{2}+y^{2}-1\\right) \\\\<br \/>\n\\frac{\\partial}{\\partial y}\\left(x^{2}+y^{2}-1\\right)<br \/>\n\\end{array}\\right]=\\left[\\begin{array}{c}<br \/>\n2 x \\\\ 2 y<br \/>\n\\end{array}\\right]$$<\/p>\n<p>Therefore, the tangency condition results in<br \/>\n$$ \\left[\\begin{array}{l}<br \/>\n2 \\\\ 1<br \/>\n\\end{array}\\right]=\\lambda \\left[\\begin{array}{l}<br \/>\n2 x_{m} \\\\ 2 y_{m}<br \/>\n\\end{array}\\right] $$<\/p>\n<p>We can rewrite the vector form into individual equations that can be solved by hand:<\/p>\n<ul>\n<li>\\(2 = \\lambda 2 x_m \\)<\/li>\n<li>\\(1 = \\lambda 2 y_m \\)<\/li>\n<li>\\(x_m^2 + y_m^2 = 1 \\)<\/li>\n<\/ul>\n<p>Solving the equations yields<br \/>\n$$ \\begin{aligned}<br \/>\n\\left(x_{0}, y_{0}\\right) &amp;=\\left(\\frac{1}{\\lambda_{0}}, \\frac{1}{2 \\lambda_{0}}\\right) \\\\<br \/>\n&amp;=\\left(\\frac{2}{\\sqrt{5}}, \\frac{1}{\\sqrt{5}}\\right) \\quad \\text { or } \\quad\\left(\\frac{-2}{\\sqrt{5}}, \\frac{-1}{\\sqrt{5}}\\right)<br \/>\n\\end{aligned} $$<\/p>\n<p>where the first point denotes a maximum (what we wanted to find) and the second a minimum. This solves our constrained optimization problem. For more details and a full solution look at <a href=\"https:\/\/www.khanacademy.org\/math\/multivariable-calculus\/applications-of-multivariable-derivatives\/constrained-optimization\/a\/lagrange-multipliers-single-constraint\">this Khan academy post<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"6.2-Recap-Lagrangian--\"><span class=\"ez-toc-section\" id=\"Recap-Lagrangian\"><\/span>Recap Lagrangian <a id=\"recap-lagrangian\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The Lagrangian is a way to repackage the individual conditions of our constrained optimization problem into a single equation. In the example above, we wanted to optimize some function \\(f(x, y)\\) under the constraint that the inputs \\(x\\) and \\(y\\) satisfy the equation \\(g(x, y) = x^2 + y^2 = c\\). In our case the constant \\(c\\) was given by 1. We know that the solution is given by those points where the gradients of \\(f\\) and \\(g\\) align. The Lagrangian function puts all of this into a single equation:<\/p>\n<p>$$ \\mathcal{L}(x, y, \\lambda)=f(x, y)-\\lambda(g(x, y)-c) $$<\/p>\n<p>When computing the partial derivatives of \\(\\mathcal{L}\\) with respect to \\(x, y\\) and \\(\\lambda\\) and setting them to zero, we will find that they correspond exactly to the three constraints we looked at earlier. This can be summarized by simply setting the gradient of \\(\\mathcal{L}\\) equal to the zero vector: \\(\\nabla \\mathcal{L} = \\pmb{0}\\). The compact Lagrangian form is often used when solving constrained optimization problem with computers because it summarizes the elaborate problem with multiple constraints into a single equation. For more details and a full solution look at <a href=\"https:\/\/www.khanacademy.org\/math\/multivariable-calculus\/applications-of-multivariable-derivatives\/constrained-optimization\/a\/lagrange-multipliers-single-constraint\">this Khan academy post<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"6.3-Dual-optimization-problem---\"><span class=\"ez-toc-section\" id=\"Dual-optimization-problem\"><\/span>Dual optimization problem <a id=\"dual-optimization-problem\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For the primal soft-margin SVM we considered the following optimization problem:<br \/>\n$$ \\min_{\\pmb{w}, b, \\pmb{\\xi}} \\frac{1}{2} \\Vert \\pmb{w} \\Vert^2 + C \\sum_{n=1}^N \\xi_n $$<\/p>\n<p>$$ \\text{subject to:} $$$$ y_n(\\langle \\pmb{w}, \\pmb{x}_n \\rangle + b) \\ge 1 &#8211; \\xi_n $$$$ \\xi_i \\gt 0 \\text{ for all } n = 1, &#8230;, N $$<\/p>\n<p>To derive the corresponding Lagrangian we will introduce two Lagrange multipliers: \\(\\alpha_n\\) for the first constraint (that all examples are classified correctly) and \\(\\lambda_n\\) for the second constraint (non-negativity of the slack variables). The Lagrangian is then given by:<\/p>\n<p>$$<br \/>\n\\begin{aligned}<br \/>\n\\mathfrak{L}(\\boldsymbol{w}, b, \\xi, \\alpha, \\gamma)=&amp; \\frac{1}{2}\\|\\boldsymbol{w}\\|^{2}+C \\sum_{n=1}^{N} \\xi_{n} \\\\<br \/>\n&amp; \\underbrace{-\\sum_{n=1}^{N} \\alpha_{n}\\left(y_{n}\\left(\\left\\langle\\boldsymbol{w}, \\boldsymbol{x}_{n}\\right\\rangle+b\\right)-1+\\xi_{n}\\right)}_{\\text{first constraint}} \\underbrace{-\\sum_{n=1}^{N} \\gamma_{n} \\xi_{n}}_{\\text{second constraint}}<br \/>\n\\end{aligned}<br \/>\n$$<\/p>\n<p>Next, we have to compute the partial derivatives of the Lagrangian with respect to the variables \\(\\pmb{w}, b\\) and \\(\\xi\\): \\(\\frac{\\partial \\mathfrak{L}}{\\partial \\pmb{w}}, \\frac{\\partial \\mathfrak{L}}{\\partial b}, \\frac{\\partial \\mathfrak{L}}{\\partial \\xi}\\). When setting the first partial derivative to zero, we obtain an important interim result:<br \/>\n$$ \\pmb{w} = \\sum_{n=1}^N \\alpha_n y_n \\pmb{x}_n $$<\/p>\n<p>This equation states the the optimal solution for the weight vector is given by a linear combination of our training examples. After setting the other partial derivatives to zero, using the result and simplifying the equations, we end up with the following optimization problem (for details see section 12.3.1 of the <a href=\"https:\/\/mml-book.com\">Mathematics for Machine Learning book<\/a>):<\/p>\n<p>$$\\min _{\\boldsymbol{\\alpha}} \\frac{1}{2} \\sum_{i=1}^{N} \\sum_{j=1}^{N} y_{i} y_{j} \\alpha_{i} \\alpha_{j}\\left\\langle\\pmb{x}_{i}, \\pmb{x}_{j}\\right\\rangle-\\sum_{i=1}^{N} \\alpha_{i}$$$$ \\text{subject to:} $$$$\\sum_{i=1}^{N} y_{i} \\alpha_{i}=0$$$$0 \\le \\alpha_{i} \\le C \\text{ for all } i=1, \\ldots, N$$<\/p>\n<p>This constrained quadratic optimization problem can be solved very efficiently, for example with quadratic programming techniques. One popular library for solving dual SVMs is <a href=\"https:\/\/github.com\/cjlin1\/libsvm\">libsvm<\/a> which makes use of a decomposition method to solve the problem (see <a href=\"https:\/\/www.csie.ntu.edu.tw\/~cjlin\/papers\/libsvm.pdf\">this paper<\/a> for more details). However, several other approaches exist.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"7.-Primal-vs.-dual-approach-\"><span class=\"ez-toc-section\" id=\"Primal-vs-dual-approach\"><\/span>Primal vs. dual approach <a id=\"primal-vs-dual\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Most SVM research in the last decade has been about the dual formulation. Why this is the case, is not clear. Both approaches have advantages and disadvantages. In the paper <a href=\"https:\/\/ieeexplore.ieee.org\/abstract\/document\/6796736\">&#8222;Training a Support Vector Machine in the Primal&#8220;<\/a> Chapelle et al. mention the following hypothesis:<\/p>\n<blockquote><p>We believe that it is because SVMs were first introduced in their hard margin formulation (Boser et al., 1992), for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account (Chapelle et al., 2000). We do not pretend that primal optimization is better in general; our main motivation wasto point out that primal and dual are two sides of the same coin and that there is no reason to look always at the same side.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h2 id=\"8.-Kernels-\/-non-linear-SVM-\"><span class=\"ez-toc-section\" id=\"Kernels-non-linear-SVM\"><\/span>Kernels \/ non-linear SVM <a id=\"kernel-svms\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3 id=\"8.1-What-is-a-kernel?-\"><span class=\"ez-toc-section\" id=\"What-is-a-kernel\"><\/span>What is a kernel? <a id=\"what-is-a-kernel\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>If you take another look at the optimization equation of dual SVMs you will notice that it computes the inner product \\(\\left\\langle\\pmb{x}_{i}, \\pmb{x}_{j}\\right\\rangle\\) between all datapoints \\(\\pmb{x}_{i}, \\pmb{x}_{j}\\). A kernel is a way to compute this inner product implicitely in some (potentially very high dimensional) feature space. To be more precise: assume we have some mapping function \\(\\varphi\\) which maps an \\(n\\) dimensional input vector to an \\(m\\) dimensional output vector: \\(\\varphi \\, : \\, \\mathbb R^n \\to \\mathbb R^m\\). Given this mapping function, we can compute the dot product of two vectors \\(\\mathbf x\\) and \\(\\mathbf y\\) in this space as follows: \\(\\varphi(\\mathbf x)^T \\varphi(\\mathbf y)\\).<\/p>\n<p>A kernel is a function \\(k\\) that gives the same result as this dot product: \\(k(\\mathbf x, \\mathbf y) = \\varphi(\\mathbf x)^T \\varphi(\\mathbf y)\\). In other words: the kernel function is equivalent to the dot product of the mapping function.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"8.2-What-are-kernels-good-for?-\"><span class=\"ez-toc-section\" id=\"What-are-kernels-good-for\"><\/span>What are kernels good for? <a id=\"what-are-kernels-good-for\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Until now (apart from the soft-margin SVM) our SVMs, both primal and dual, are only able to classify data that is <a href=\"https:\/\/en.wikipedia.org\/wiki\/Linear_separability\">linearly separable<\/a>. However, most datasets in practice will not be of this form. We need a way to classify data that is <strong>not<\/strong> linearly separable. This is where the so called <strong>kernel trick<\/strong> comes into play.<\/p>\n<p>Because the objective function of the dual SVM contains inner products only between datapoints \\(\\pmb{x}_i, \\pmb{x}_j\\), we can easily replace this inner product (that is, \\(\\left\\langle\\pmb{x}_{i}, \\pmb{x}_{j}\\right\\rangle\\) ) with some mapping function \\(\\varphi(\\pmb{x}_i)^T \\varphi(\\pmb{x}_j)\\). This mapping function can be non-linear, allowing us to compute an SVM that is non-linear with respect to the input examples. The mapping function takes our input data (which is not linearly separable) and transforms it into some higher-dimensional space where it becomes linearly separable. This is illustrated in the figure below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21219\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration.png\" alt=\"The mapping function takes input data and transforms it into some higher-dimensional feature space.\" width=\"776\" height=\"339\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration.png 1261w, https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration-300x131.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration-1024x447.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration-768x336.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration-400x175.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/feature_mapping_illustration-360x157.png 360w\" sizes=\"auto, (max-width: 776px) 100vw, 776px\" \/><\/p>\n<p>In theory, we could use any mapping function we like. In practice, however, computing inner products is expensive. Therefore, we use mapping functions that have a corresponding kernel function. This will allow us to map the datapoints into a higher dimensional space without ever explicitely computing the (expensive) inner products.<\/p>\n<p>Let&#8217;s take a look at an example.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"8.3-Example-\"><span class=\"ez-toc-section\" id=\"Example\"><\/span>Example<a id=\"kernel-example\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Note: this example was taken from <a href=\"https:\/\/stats.stackexchange.com\/questions\/152897\/how-to-intuitively-explain-what-a-kernel-is\/355046#355046https:\/\/stats.stackexchange.com\/questions\/152897\/how-to-intuitively-explain-what-a-kernel-is\/355046#355046\">this StackExchange post<\/a>.<\/p>\n<p>We can create a simple polynomial kernel as follows: \\(k(\\pmb{x}, \\pmb{y}) = (1 + \\mathbf x^T \\mathbf y)^2\\) with \\(\\mathbf x, \\mathbf y \\in \\mathbb R^2\\). The kernel does not seem to correspond to any mapping function \\(\\varphi\\), it is just a function that returns a real number. Our input vectors \\(\\pmb{x}, \\pmb{y}\\) are 2-dimensional: \\(\\mathbf x = (x_1, x_2)\\) and \\(\\mathbf y = (y_1, y_2)\\). With this knowledge, we can expand the kernel computation:<\/p>\n<p>$$\\begin{align}<br \/>\nk(\\mathbf x, \\mathbf y) &amp; = (1 + \\mathbf x^T \\mathbf y)^2 \\\\<br \/>\n&amp;= (1 + x_1 \\, y_1 + x_2 \\, y_2)^2 \\\\<br \/>\n&amp; = 1 + x_1^2 y_1^2 + x_2^2 y_2^2 + 2 x_1 y_1 + 2 x_2 y_2 + 2 x_1 x_2 y_1 y_2<br \/>\n\\end{align}$$<\/p>\n<p>Note that this is nothing else but a dot product between two vectors \\((1, x_1^2, x_2^2, \\sqrt{2} x_1, \\sqrt{2} x_2, \\sqrt{2} x_1 x_2)\\) and \\((1, y_1^2, y_2^2, \\sqrt{2} y_1, \\sqrt{2} y_2, \\sqrt{2} y_1 y_2)\\). This can be expressed with the following mapping function:<br \/>\n$$\\varphi(\\mathbf x) = \\varphi(x_1, x_2) = (1, x_1^2, x_2^2, \\sqrt{2} x_1, \\sqrt{2} x_2, \\sqrt{2} x_1 x_2)$$<\/p>\n<p>So the kernel \\(k(\\mathbf x, \\mathbf y) = (1 + \\mathbf x^T \\mathbf y)^2 = \\varphi(\\mathbf x)^T \\varphi(\\mathbf y)\\) computes a dot product in 6-dimensional space without explicitly visiting this space. The generalization from an inner product to a kernel function is known as the <strong>kernel trick<\/strong>.<\/p>\n<p>Several popular kernel functions exist. Popular ones are, for example, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Polynomial_kernel\">polyomial kernel<\/a> or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Radial_basis_function_kernel\">RBF kernel<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"8.4-Can-we-also-use-kernels-in-the-primal-SVM?-\"><span class=\"ez-toc-section\" id=\"Can-we-also-use-kernels-in-the-primal-SVM\"><\/span>Can we also use kernels in the primal SVM?<a id=\"kernel-in-primal-svm\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Yes, the kernel trick can be applied to primal SVMs, too. It is not as straightforward as with dual SVMs but still possible. Consider <a href=\"https:\/\/ieeexplore.ieee.org\/document\/6796736\">this paper by Chapelle et al.<\/a> as an example.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"8.5-How-do-I-choose-the-right-kernel-for-my-problem?-\"><span class=\"ez-toc-section\" id=\"How-do-I-choose-the-right-kernel-for-my-problem\"><\/span>How do I choose the right kernel for my problem?<a id=\"choosing-the-right-kernel\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The problem of choosing the right kernel has been answered in <a href=\"https:\/\/stats.stackexchange.com\/questions\/18030\/how-to-select-kernel-for-svm\">this StackExchange post<\/a>.<\/p>\n<p>Summary:<\/p>\n<ul>\n<li>Without expert knowledge, the Radial Basis Function kernel makes a good default kernel (in case you need a non-linear model)<\/li>\n<li>The choice of the kernel and parameters can be automated by optimising a cross-validation based model selection<\/li>\n<li>Choosing the kernel and parameters automatically is tricky, as it is very easy to overfit the model selection criterion<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"cell border-box-sizing text_cell rendered\">\n<div class=\"prompt input_prompt\"><\/div>\n<div class=\"inner_cell\">\n<div class=\"text_cell_render border-box-sizing rendered_html\">\n<h3 id=\"9.-Sources-and-further-reading-\"><span class=\"ez-toc-section\" id=\"Sources-and-further-reading\"><\/span>Sources and further reading<a id=\"sources\" class=\"anchor\"><\/a><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The basis for this notebook is chapter 12 of the book <a href=\"https:\/\/mml-book.github.io\/\">Mathematics for Machine Learning<\/a>. I can highly recommend to read through the entire chapter to get a deeper understanding of support vector machines.<\/p>\n<p>Another source I liked very much is <a href=\"https:\/\/www.youtube.com\/watch?v=_PwhiWxHK8o\">this MIT lecture on SVMs<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In this post we will talk about a fundamental model in machine learning \u2013 support vector machines (short: SVMs). We will take a detailed look at the theory behind SVMs and implement and train an SVM in plain Python. In particular, we will consider the two approaches for formulating SVMs: the primal and the dual [&hellip;]<\/p>\n","protected":false},"author":194,"featured_media":30727,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[511,151,140],"service":[76],"coauthors":[{"id":194,"display_name":"Anna-Lena Popkes","user_nicename":"apopkes"}],"class_list":["post-21217","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-artificial-intelligence-2","tag-deep-learning","tag-machine-learning","service-artificial-intelligence"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Extensive Guide to Support Vector Machines - inovex GmbH<\/title>\n<meta name=\"description\" content=\"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Extensive Guide to Support Vector Machines - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2021-07-07T05:01:55+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-01-18T09:29:57+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1366\" \/>\n\t<meta property=\"og:image:height\" content=\"768\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Anna-Lena Popkes\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Anna-Lena Popkes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"31\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Anna-Lena Popkes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/\"},\"author\":{\"name\":\"Anna-Lena Popkes\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/3df260d7f75d6a1efd09edc3741c0fdf\"},\"headline\":\"Extensive Guide to Support Vector Machines\",\"datePublished\":\"2021-07-07T05:01:55+00:00\",\"dateModified\":\"2024-01-18T09:29:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/\"},\"wordCount\":4910,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/hero-support-vector-machines.png\",\"keywords\":[\"Artificial Intelligence\",\"Deep Learning\",\"Machine Learning\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/\",\"name\":\"Extensive Guide to Support Vector Machines - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/hero-support-vector-machines.png\",\"datePublished\":\"2021-07-07T05:01:55+00:00\",\"dateModified\":\"2024-01-18T09:29:57+00:00\",\"description\":\"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/hero-support-vector-machines.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/hero-support-vector-machines.png\",\"width\":1366,\"height\":768,\"caption\":\"a sylized support vector machine\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/support-vector-machines-guide\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Extensive Guide to Support Vector Machines\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/3df260d7f75d6a1efd09edc3741c0fdf\",\"name\":\"Anna-Lena Popkes\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-lena-96x96.jpgb761218ea0c3098c8e9b2c42d8b06a83\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-lena-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-lena-96x96.jpg\",\"caption\":\"Anna-Lena Popkes\"},\"description\":\"Hi, I'm Anna-Lena, a senior machine learning engineer here at inovex. I am an enthusiastic learner always looking for new challenges to expand my knowledge and skills. I am deeply fascinated by machine learning and its applications to questions that affect and benefit society. This translates into a passion for teaching, where I can share my insights and contribute to the collective growth of the machine learning community.\",\"sameAs\":[\"https:\\\/\\\/alpopkes.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/anna-lena-popkes\"],\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/apopkes\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Extensive Guide to Support Vector Machines - inovex GmbH","description":"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/","og_locale":"de_DE","og_type":"article","og_title":"Extensive Guide to Support Vector Machines - inovex GmbH","og_description":"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.","og_url":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2021-07-07T05:01:55+00:00","article_modified_time":"2024-01-18T09:29:57+00:00","og_image":[{"width":1366,"height":768,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png","type":"image\/png"}],"author":"Anna-Lena Popkes","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Anna-Lena Popkes","Gesch\u00e4tzte Lesezeit":"31\u00a0Minuten","Written by":"Anna-Lena Popkes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/"},"author":{"name":"Anna-Lena Popkes","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/3df260d7f75d6a1efd09edc3741c0fdf"},"headline":"Extensive Guide to Support Vector Machines","datePublished":"2021-07-07T05:01:55+00:00","dateModified":"2024-01-18T09:29:57+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/"},"wordCount":4910,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png","keywords":["Artificial Intelligence","Deep Learning","Machine Learning"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/","url":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/","name":"Extensive Guide to Support Vector Machines - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png","datePublished":"2021-07-07T05:01:55+00:00","dateModified":"2024-01-18T09:29:57+00:00","description":"This article takes a detailed look at the theory behind Support Vector Machines and shows how to train an SVM in plain Python.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/hero-support-vector-machines.png","width":1366,"height":768,"caption":"a sylized support vector machine"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/support-vector-machines-guide\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Extensive Guide to Support Vector Machines"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/3df260d7f75d6a1efd09edc3741c0fdf","name":"Anna-Lena Popkes","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-lena-96x96.jpgb761218ea0c3098c8e9b2c42d8b06a83","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-lena-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-lena-96x96.jpg","caption":"Anna-Lena Popkes"},"description":"Hi, I'm Anna-Lena, a senior machine learning engineer here at inovex. I am an enthusiastic learner always looking for new challenges to expand my knowledge and skills. I am deeply fascinated by machine learning and its applications to questions that affect and benefit society. This translates into a passion for teaching, where I can share my insights and contribute to the collective growth of the machine learning community.","sameAs":["https:\/\/alpopkes.com\/","https:\/\/www.linkedin.com\/in\/anna-lena-popkes"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/apopkes\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21217","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/194"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=21217"}],"version-history":[{"count":7,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21217\/revisions"}],"predecessor-version":[{"id":51057,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21217\/revisions\/51057"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/30727"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=21217"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=21217"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=21217"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=21217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}