At its core, Support Vector Machine aims to find the optimal hyperplane that separates data points belonging to different classes with the maximum margin. The term “support vectors” refers to the data points that lie closest to the decision boundary, influencing the positioning and orientation of the hyperplane. The SVM is particularly effective in high-dimensional spaces, making it suitable for a variety of applications. We will look at the formulation, and code implementation of this model.
Formulation
Let us our discussion of support vector machines by considering the two-class classification problem using linear models such that $$y(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b$$ where $\phi(\mathbf{x})$ defines a transformation of the feature space, and we have defined the parameter $b$ explicitly. The training data set comprises $N$ input vectors $\mathbf{x}_1, . . . , \mathbf{x}_N$ , with target values $t_1,…,t_N$ where $t_n \in \{−1,1\}$, and new data points $x_{\text{test}}$ are classified according to the sign of $y(x)$.
Assuming that the dataset is linearly separable in the feature space, so that there exists at least one choice of the parameters $\mathbf{w}$ and $b$ such that $y(\mathbf{x}_n) > 0$ for points having $t_n =+1$ and $y(\mathbf{x}_n) < 0$ for points having $t_n =−1$, so that $t_n y(\mathbf{x}_n) > 0$ for all training data points.
We can see that the perpendicular distance of a point $\mathbf{x}$ from a hyperplane defined by $y(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b = 0$ can be calculated by $|y_n| / \lVert w \rVert$. Since we are interested in solutions which have $t_n y_n > 0$ for all data points, we can rewrite this margin as $$\cfrac{t_n y_n}{\lVert w \rVert} = \cfrac{t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b)}{\lVert w \rVert}$$ In SVM, we find the parameters $\mathbf{w}$ and $b$ by maximising the margin, which is the distance of the closest points to the separating hyperplane. Thus, the maximum margin solution can be found by $$\text{arg max}_{\mathbf{w}, b} \{ \cfrac{1}{\lVert w \rVert} \min_n (t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b))\}$$ Since this problem is quite complex, we will convert it into an equivalent problem that is easier to solve.
Notice that if we rescale the parameters $\mathbf{w}$ to $k \mathbf{w}$ and $b$ to $kb$, then the margin $\cfrac{t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b)}{\lVert \mathbf{w} \rVert}$ does not change. We can use this freedom to set $$t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b) = 1$$ for the distance of the closest points to the surface. If we do this, then for all the training data, we have that $$t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b) \ge 1$$ The problem becomes $$\text{arg min} \cfrac{1}{2} \lVert \mathbf{w} \rVert^2$$ subject to $t_n(\mathbf{w}^T \phi(\mathbf{x}_n) + b) \ge 1$ for all the training points.
Solving the constrained optimization problem
We can solve the stated constrained optimization problem by forming the Lagrangian to get $$L(\mathbf{w}, b, \mathbf{\lambda}) = \cfrac{1}{2} \lVert \mathbf{w} \rVert^2 – \sum_{n=1}^N \lambda_i ((\mathbf{w}^T \phi(\mathbf{x}_n) + b) – 1)$$ Then, we can find the Lagrange dual of the problem by setting the gradient of the Lagrangian w.r.t $\mathbf{w}$ and $b$ to zero such to have that $$\mathbf{w} = \sum_{n=1}^{N} \lambda_n t_n \phi(\mathbf{x}_n)$$ and $$\sum_{n=1}^{N} \lambda_n t_n = 0$$
Using this $w$ and replacing it back to the Lagrangian, we can now form the Lagrange dual $$g(\mathbf{\lambda}) = \sum_{n=1}^N \lambda_n\,-\,\cfrac{1}{2} \sum_{n} \sum_{m} \lambda_n \lambda_m t_n t_m \mathbf{k}(\mathbf{x}_n, \mathbf{x}_m)$$ where $\mathbf{k}(\mathbf{x}_n, \mathbf{x}_m) = \phi(\mathbf{x}_n)^T\phi(\mathbf{x}_m)$. The constraints of the problem are $\lambda_i \ge 0$ for $n = 1, 2, …, N$ and $\sum_{n=1}^{N} \lambda_n t_n = 0$
We can solve the original problem by finding the maximum of this Lagrange dual using any off-the-shelf solver to obtain appropriate $\mathbf{\lambda}^*$. The solution $\mathbf{w}^*$ can then be found using $\mathbf{\lambda}^*$. $b^*$ can be found using the formula $$b^* = -\cfrac{1}{2} (\max_{t_n = -1} (\mathbf{w}^{*T} \mathbf{x}_n) + \min_{t_n = 1} (\mathbf{w}^{*T} \mathbf{x}_n))$$
Python Implementation
The code implementation of Support Vector Machine on Gaussian mixtures https://github.com/SonPhatTranDeveloper/support-vector-machine
For the implementation of SVM, a convex optimization library called CVXOPT was used to solve the quadratic programming problem to find $\mathbf{\lambda}^*$
def support_vector_machine(X_train, t_train):
"""
Find the parameters w and b in Support Vector Machine
:param X_train:
:param t_train:
:return:
"""
# Get the number of data points
N = X_train.shape[0]
# Solve the dual problem using convex optimization library
P = 2 * (X_train @ X_train.T) * (t_train @ t_train.T)
P = matrix(P)
q = - np.ones(N)
q = matrix(q)
G = - np.eye(N)
G = matrix(G)
h = np.zeros(N)
h = matrix(h)
A = matrix(t_train.reshape(1, -1))
b = matrix([0.0])
# Find a
sol = cvxopt.solvers.qp(P, q, G, h, A, b)
a = np.array(sol['x']).reshape(-1)
# Find w
w = X.T @ (a * t_train)
# Find b
X_1 = X_train[t_train.astype(int) == -1]
X_2 = X_train[t_train.astype(int) == 1]
b = (-0.5) * (np.max(np.dot(X_1, w)) + np.min(np.dot(X_2, w)))
return w, b