In the first few posts I created on the site, we have discussed about Gaussian Processes and how these models can be used in the problem of regression, as shown in the book “Gaussian Processes for Machine Learning” by Rasmussen. In today’s post, as I was working through the PRML book by Bishop, we will talk about this topic again, including its formulation and code implementation. The differences will mostly be notational, but I will include a new code repository where a different dataset (this time it will be sinusoidal) and kernel function is used to illustrate the idea.
Formulation
Firstly, let us consider a model defined in terms of a linear combination of some fixed basis functions given by the elements of the vector $\phi (\mathbf{x})$ such that $$y(\mathbf{x}) = \mathbf{w}^T \mathbf{\phi}(\mathbf{x})$$ Now, we turn our attention to the vector $\mathbf{y}$ where elements of $\mathbf{y}$ are $y(\mathbf{x}_1), y(\mathbf{x}_1), …, y(\mathbf{x}_N)$. It can be seen that $$\mathbf{y} = \mathbf{\Phi} \mathbf{w}$$ where the rows of $\Phi$ are $\phi(\mathbf{x}_1), \phi(\mathbf{x}_1), …, \phi(\mathbf{x}_1)$.
Let us assume that $\mathbf{w}$ is Gaussian, with mean $\mathbf{0}$ and covariance $\alpha^{-1} I$. Since $w$ has Gaussian distribution, we can also see that $\mathbf{y}$ is also a Gaussian random variable, where the mean and covariance matrix are defined as $$E[\mathbf{y}] = \mathbf{\Phi} E[\mathbf{w}] = \mathbf{0}$$ and $$Cov(\mathbf{y}) = \mathbf{\Phi} E[\mathbf{w} \mathbf{w}^T] \mathbf{\Phi}^T = \cfrac{1}{\alpha} \mathbf{\Phi} \mathbf{\Phi}^T = \textbf{K}$$ where $\textbf{K}$ is a Gram matrix with elements $$K(\mathbf{x}_n, \mathbf{x}_m) = \mathbf{k}(\mathbf{x}_n, \mathbf{x}_m) = \cfrac{1}{\alpha} \phi(\mathbf{x}_n)^T \phi(\mathbf{x}_m)$$ To apply Gaussian process models to the problem of regression, we also need to consider the noise on the observed target values, which are given by $$t_n = y_n + \epsilon_n$$ where $\epsilon_n$ is normally distributed. The joint distribution of the target variable $\mathbf{t}$ conditioned on the values of $\mathbf{y}$ is given by an isotropic Gaussian of the form $$p(\mathbf{t} | \mathbf{y}) = N(\mathbf{t} | \mathbf{y}, \beta^{-1} I)$$ We can find the marginal of $\mathbf{t}$ as follows $$p(\mathbf{t}) = \int p(\mathbf{t} | \mathbf{y}) p(\mathbf{y}) d\mathbf{y} = N(\mathbf{t} | \mathbf{0}, \mathbf{C})$$ where the elements of the matrix $\mathbf{C}$ is $\mathbf{C}(\mathbf{x}_n, \mathbf{x}_m) = \mathbf{k}(\mathbf{x}_n, \mathbf{x}_m) + \beta^{-1} \delta_{nm}$
Similarly, the joint distribution over $t_1 , . . . , t_{N +1}$ will be given by $$p(\mathbf{t}_{N+1}) = N(\mathbf{t}_{N+1} | \mathbf{0}, \mathbf{C}_{N+1})$$ where $\mathbf{C}_{N+1}$ is defined as $$\mathbf{C}_{N+1} = \begin{bmatrix} \mathbf{C}_{N} & \mathbf{k} \\ \mathbf{k} & c\end{bmatrix}$$ where the vector k has elements $k(\mathbf{x}_n, \mathbf{x}_{N+1})$ for $n = 1,…,N$ and the scalar $c = \mathbf{k}(\mathbf{x}_{N+1}, \mathbf{x}_{N+1})$
Finally, we can see that the conditional distribution is also Gaussian distributed, and can be proven to have mean and variance $$m_{N+1} = \mathbf{k}^T \mathbf{C}_N^{-1} \mathbf{t}$$ and $$\sigma_{N+1} = c\,-\,\mathbf{k}^T \mathbf{C}_N^{-1} \mathbf{k}$$
Code implementation
The code implementation for Gaussian process regression https://github.com/SonPhatTranDeveloper/gaussian-process-for-regression-updated
def gaussian_process_regression(X, Y, noise_var, gamma, test_points):
"""
Calculate the Gaussian Process prediction
:param X: inputs
:param y: targets
:param noise_var: noise level
:param test_point: test input
:return: mean and variance of the prediction
"""
# Calculate the covariance matrix
K = rbf_kernel(X.reshape(-1, 1), gamma=gamma)
# Perform Cholesky factorization of K + noise_var * I
L = np.linalg.cholesky(K + noise_var * np.eye(K.shape[0]))
# Calculate alpha
beta = scipy.linalg.solve_triangular(L, Y, lower=True)
alpha = scipy.linalg.solve_triangular(L.T, beta, lower=False)
# Calculate the mean and variance of each test points
means = []
vars = []
for test_point in test_points:
k_star = rbf_kernel(np.array([test_point]).reshape(-1, 1), X.reshape(-1, 1), gamma=gamma).reshape(-1)
mean_f = np.dot(k_star, alpha)
# Calculate the variance
v = scipy.linalg.solve_triangular(L, k_star.reshape(-1), lower=True)
var_f = (rbf_kernel(np.array([test_point]).reshape(-1, 1), np.array([test_point]).reshape(-1, 1), gamma=gamma)
- np.dot(v, v))[0][0]
# Append to the mean and variance
means.append(mean_f)
vars.append(var_f)
# Return the mean and variance
return np.array(means), np.array(vars)