Apart from classification, Support Vector Machine models can also be applied to regression problem, and is often called Support Vector Regression (SVR). The fundamental idea behind SVR is to find a hyperplane in a high-dimensional space that best captures the relationships between input variables and their corresponding continuous output values.
The hyperplane is constructed in such a way that it maximizes the margin between the data points and a specified error threshold, known as the epsilon-insensitive tube. This allows SVR to be robust against outliers and focus on fitting the majority of the data within the defined margin. We will look at the formulation, and code implementation of SVR.
Formulation
In simple linear regression, we solve an optimization problem to minimise $$\cfrac{1}{2} \sum_n \{ t_n\,–\,y_n \}^{2} + \cfrac{1}{2} \lVert \mathbf{w}^2 \rVert$$ To obtain sparse solutions, the quadratic error function is replaced by an $\epsilon$–insensitive error function, of which one notable example is $$E_{\epsilon}(y(\mathbf{x}_n)\,-\,t) = \begin{cases} 0 \text{ if } |y(\mathbf{x}_n)\,-\,t| < \epsilon \\ |y(\mathbf{x}_n)\,-\,t|\,-\,\epsilon \text{ otherwise } \end{cases}$$ we can re-express the optimization problem by introducing slack variables. For each data point $\mathbf{x}_n$, we now introduce two slack variables $\xi_n \ge 0$ and $\hat{\xi}_n \ge 0$ for points which have $t_n > y(\mathbf{x}_n) + \epsilon$ and $t_n < y(\mathbf{x}_n)\,-\,\epsilon$ respectively.
Introducing the slack variables allows data points to lie outside the tube if the slack variables are nonzero, and the corresponding conditions are $$t_n \ge y(\mathbf{x}_n) + \epsilon + \xi_n$$ and $$t_n \le y(\mathbf{x}_n)\,-\,\epsilon\,-\,\hat{\xi}_n$$ The error function can then be written as $$C \sum_n(\xi_n + \hat{\xi}_n) + \cfrac{1}{2} \lVert \mathbf{w} \rVert^2$$ The constraints are $\xi_n \ge 0$ and $\hat{\xi}_n \ge 0$, as well as two constraints for $t_n$ stated above.
Now, we can introduce the dual variables $a_n \ge 0$, $\hat{a}_n \ge 0$, $\mu_n \ge 0$, $\hat{\mu}_n \ge 0$ and the Lagrangian $$L = C \sum_n(\xi_n + \hat{\xi}_n) + \cfrac{1}{2} \lVert \mathbf{w} \rVert^2 \, – \, \sum_n(\mu_n \xi_n + \hat{\mu}_n \hat{\xi}_n)\,-\,\sum_{n} a_n(\epsilon + \xi_n + y_n\,-\,t_n)\,-\,\sum_{n} \hat{a}_n(\epsilon + \hat{\xi}_n \,-\, y_n + t_n)$$ Next, we will find the Lagrange dual by minimising the Lagrangian w.r.t the primal variable to get that $$\cfrac{dL}{d \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum_n (a_n\,-\,\hat{a}_n) \phi(\mathbf{x}_n)$$ $$\cfrac{dL}{db} = 0 \Rightarrow \sum_n (a_n\,-\,\hat{a}_n) = 0$$ $$\cfrac{dL}{d\xi_n} = 0 \Rightarrow a_n + \mu_n = C$$ $$\cfrac{dL}{d\hat{\xi}_n} = 0 \Rightarrow \hat{a}_n + \hat{\mu}_n = C$$ Replacing these results back in the original function, we can see that the Langrange dual of the problem is $$L(\mathbf{a}, \mathbf{\hat{a}}) = -\cfrac{1}{2} \sum_n \sum_m (a_n\,-\,\hat{a}_n) (a_m\,-\,\hat{a}_m) \mathbf{k} (\mathbf{x}_n, \mathbf{x}_m)\,-\,\epsilon \sum_n (a_n + \hat{a}_n) + \sum_n (a_n\,-\,\hat{a}_n)t_n$$ We can now maximise this objective function to obtain the solution for the original problem, with constraints $a_n \ge 0$, $a_n \ge 0$. Note that since $\mu_n \ge 0$ and $\hat{\mu}_n \ge 0$ and $a_n + \mu_n = C$ and $\hat{a}_n + \hat{\mu}_n = C$, we also have constraints $a_n \le C$ and $\hat{a}_n \le C$. As a result, the constraints of this dual are $$0 \le a_n \le C$$ $$0 \le \hat{a}_n \le C$$
Once we have solved for $a$ and $\hat{a}$ using a convex optimization for quadratic programming, for example, we obtain $\mathbf{w}$ using $\mathbf{w} = \sum_n (a_n\,-\,\hat{a}_n) \phi(\mathbf{x}_n)$. For $b$, we can select points such that $0 < a_n < C$, based on the constrained obtained via derivative above, we can solve for $b$ using $$b = t\,-\,\epsilon\,-\,\sum_n (a_n\,-\,\hat{a}_n) \mathbf{k} (\mathbf{x}_n, \mathbf{x})$$ New predictions can be generate by $y(\mathbf{x}) = \sum_{n} (a_n\,-\,\hat{a}_n) \mathbf{k}(\mathbf{x}_n, \mathbf{x})$
Python Implementation
We use the Python library ‘cvxopt‘ to solve the quadratic programming for $a$ and $b$
The code implementation of Support Vector Machine for regression https://github.com/SonPhatTranDeveloper/support-vector-regression
def support_vector_regression(X_train, t_train, epsilon, C, gamma):
"""
Find the parameters of X_train and t_train using Support Vector Regression
:param X_train: train features
:param t_train: train labels
:param epsilon: control the sensitivity of the error function
:param C: control the trade-off
:param gamma: control the RBF kernel function
:return: the parameters w and b of Support Vector Regression
"""
# Get the dimension
N = t_train.shape[0]
# Find the Gram matrix using the RBF kernel function
gram = rbf_kernel(X_train, gamma=gamma)
# Solve of a
P = np.hstack((gram, -gram))
P = np.vstack((P, -P))
P = cvxopt.matrix(P)
q = epsilon * np.ones(N * 2) - np.concatenate((t, -t))
q = cvxopt.matrix(q)
G = np.vstack((np.eye(N * 2), -np.eye(N * 2)))
G = cvxopt.matrix(G)
h = np.array([C] * (N * 2) + [0.0] * (N * 2))
h = cvxopt.matrix(h)
sol = cvxopt.solvers.qp(P, q, G, h)
a = np.array(sol['x']).reshape(-1)
# Use a to calculate b
a_1 = a[:N]
a_2 = a[N:]
# Find specific data points which have 0 < a < C
x_specific = X_train[np.logical_and(a_1 > 0, a_1 < C), :]
t_specific = t_train[np.logical_and(a_1 > 0, a_1 < C)]
# Solve for b
b = 0
for i in range(x_specific.shape[0]):
x_current = x_specific[i]
t_current = t_specific[i]
k = rbf_kernel(X_train, x_current.reshape(1, -1), gamma=gamma).reshape(-1)
b += t_current - epsilon - np.dot(a_1 - a_2, k)
b = b / x_specific.shape[0]
return a, b
def predict(X_train, a, b, X_test, gamma):
"""
Perform prediction using found parameters
:param X_train: train features
:param a: a parameters
:param b: parameter
:param X_test: test points
:param gamma: RBF kernel parameters
:return: the predictions
"""
# Split a into two vectors
N = X_train.shape[0]
a_1 = a[:N]
a_2 = a[N:]
# Predictions
predictions = []
# Make predictions
for x_test in X_test:
k = rbf_kernel(X_train, x_test.reshape(1, -1), gamma=gamma).reshape(-1)
y = np.dot(a_1 - a_2, k) + b
predictions.append(y)
return np.array(predictions)