Gaussian Processes for Classification – Statistical Learning Concepts

We have look at how to apply the method of Gaussian Processes to the problem of regression, let us now explore the application of Gaussian Processes to classification. As we will discover, many of the integral used in the formulation of this problem will be analytically intractable, so we will make heavily use of Laplace Approximation to approximate some distributions as Gaussians. We will also look at the code implementation of this model and apply it to classify some Gaussian mixtures.

Formulation

Consider first the two-class problem with a target variable $t \in \{0, 1\}$. If we define a Gaussian process over a function $a(\mathbf{x})$ and then transform the function using a logistic sigmoid $y = \sigma (a)$ then we will obtain a non-Gaussian stochastic process over functions $y(\mathbf{x})$ where $y \in (0, 1)$.

Similar to Gaussian Processes for regression, the Gaussian process prior for $\mathbf{a}_{N+1}$ takes the form $$p(\mathbf{a}_{N+1}) = N(\mathbf{a}_{N+1} | \mathbf{0}, \mathbf{C})$$ However, different from the regression case, the covariance matrix $\mathbf{C}$ no longer includes a noise term $\epsilon$ because we assume that all of the training data points are correctly labelled. Nonetheless, it is convenient to introduce a noise-like term governed with a parameter $\nu$ to make sure that the covariance matrix is positive definite.

We can formulate the predictive distribution of $t_{N+1}$ as follows $$p(t_{N+1} | \mathbf{t}_N) = \int p(t_{N+1} | a_{N + 1}) p(a_{N+1} | \mathbf{t}_N) d a_{N+1}$$ where $p(t_{N+1} | a_{N + 1})$ is given by a logistic sigmoid function $p(t_{N+1} | a_{N + 1}) = \sigma (a_{N + 1})$

As we have discussed in the Bayesian Logistic Regression, this integral is analytically intractable, but we can approximate the sigmoid function with a probit function. Furthermore, if we have a Gaussian approximation to the posterior distribution $p(a_{N+1} | \mathbf{t}_N)$, we can use the formula for the convolution of a probit with a Gaussian distribution as in the previous post.

Approximating $p(a_{N+1} | \mathbf{t}_N)$

We can use Baye’s rule to calculate the posterior distribution of $a_{N+1}$ given $\mathbf{t}_N$ by conditioning on $\mathbf{a}_N$, such that $$p(a_{N+1} | \mathbf{t}_N) = \int p(a_{N+1}, \mathbf{a}_N | \mathbf{t}_N)\,d\mathbf{a}_N= \int p(a_{N+1} | \mathbf{a}_N) p(\mathbf{a}_N | \mathbf{t}_N)\,d\mathbf{a}_N$$ Similar to the regression case, we have that $$p(\mathbf{a}_N | \mathbf{t}_N) = N(a_{N+1} | \mathbf{k}^T \mathbf{C}_N^{-1} \mathbf{a}_N, c\,-\,\mathbf{k}^T \mathbf{C}_N^{-1} \mathbf{k})$$ The goal now is to approximate the posterior distribution $p(\mathbf{a}_N | \mathbf{t}_N)$ with a Gaussian using Laplace Approximation.

Firstly, we see that the prior is Gaussian distributed with the covariance matrix $\mathbf{C}_N$. Also, it is observed that the data likelihood, $p(\mathbf{t}_n | \mathbf{a}_n)$, is given by $$p(\mathbf{t}_n | \mathbf{a}_n) = \prod_{n=1}^N \sigma (a_n)^{t_n} (1\,-\,\sigma(a_n))^{1\,-\,t_n} = \prod_{n=1}^{N} e^{a_n t_n} \sigma(-a_n)$$ From these values, we can calculate $\ln p(\mathbf{a}_N | \mathbf{t}_N)$. We have that $$\ln p(\mathbf{a}_N | \mathbf{t}_N) = \ln p(\mathbf{t}_n | \mathbf{a}_n) + \ln p(\mathbf{a}_N) + \text{const} = – \cfrac{1}{2} \mathbf{a}_N^T \mathbf{C}_N \mathbf{a}_N\,-\,\cfrac{1}{2} \ln | \mathbf{C}_N | + \mathbf{t}_N^T\mathbf{a}_N\,-\,\sum_{n=1}^N \ln(1 + e^{a_n}) + \text{const}$$ To find the mode of the approximated distribution, we first take the derivative of $\ln p(\mathbf{a}_N | \mathbf{t}_N)$ w.r.t $\mathbf{a}_N$. We have that $$\nabla \ln p(\mathbf{a}_N | \mathbf{t}_N) = \mathbf{t}_N\,-\,\mathbf{\sigma}_N\,-\,\mathbf{C}_N^{-1} \mathbf{a}_{N}$$ where $\mathbf{\sigma}_N$ contains the individual $\sigma (a_n)$. Since $\mathbf{\sigma}_N$ depends non-linearly on $\mathbf{a}_N$, we cannot simply find the mode by setting this gradient to zero. So, we will use the iterative Newton-Raphson method.

Finding the Hessian of $\ln p(\mathbf{a}_N | \mathbf{t}_N)$ w.r.t $\mathbf{a}_N$, we have that $$\nabla^2 \ln p(\mathbf{a}_N | \mathbf{t}_N) = -\,\mathbf{W}_N\,-\,\mathbf{C}_N^{-1}$$ where $\mathbf{W}_N$ is a diagonal matrix with the diagonal elements being $\sigma_n (1\,-\,\sigma_n)$.

Using the Newton-Raphson formula, the iterative update equation for $\mathbf{a}_N$ can be proven to be $$\mathbf{a}_N^{\text{new}} = \mathbf{C}_N (\mathbf{I} + \mathbf{W}_N \mathbf{C}_N)^{-1} \{ \mathbf{t}_N\,-\,\mathbf{\sigma}_N + \mathbf{W}_N \mathbf{a}_N \}$$ At the mode, the optimal solution satisfies $\mathbf{a}_{\text{new}}^* = \mathbf{C}_N (\mathbf{t}_N\,-\,\mathbf{\sigma}_N)$. Once we have the solution, we can evaluate the Hessian $\mathbf{H} = \mathbf{W}_N + \mathbf{C}_N^{-1}$ at $\mathbf{a}_{\text{new}}^*$ to find the covariance matrix $\mathbf{H}^{-1}$. The approximated distribution is then $$\mathbf{q}(\mathbf{a}_N) = N(\mathbf{a}_N | \mathbf{a}_{\text{new}}^*, \mathbf{H}^{-1})$$ We will use this approximation and the rule of marginal distribution to calculate the integral $p(a_{N+1} | \mathbf{t}_N)$. It is also a Gaussian with mean and variance $$E[a_{N+1} | \mathbf{t}_N] = \mathbf{k}^T (\mathbf{t}_N\,-\,\mathbf{\sigma}_N)$$ and $$var(a_{N+1} | \mathbf{t}_N) = c\,-\,\mathbf{k}^T(\mathbf{W}_N^{-1} + \mathbf{C}_N)^{-1} \mathbf{k}$$ Now that we have a Gaussian approximation for $p(a_{N+1} | \mathbf{t}_N)$, we can readily approximate the convolution between a probit and Gaussian pdf using the formula we have devised in the Bayesian Logistic Regression section.

Code implementation

The Python implementation of Gaussian Process classification https://github.com/SonPhatTranDeveloper/gaussian-process-classification

def sigmoid(a):
    """
    Define sigmoid function
    :param a: input to sigmoid function
    :return: value of sigmoid function
    """
    return 1 / (1 + np.exp(-a))

def find_posterior_mean(X, t, nu, gamma, learning_rate, epochs):
    """
    Find the mode of the distribution p(a_N | t_N)
    :param X: train features
    :param t: train labels
    :param nu: added term to C, the covariance matrix
    :param gamma: the parameter of RBF kernel function
    :param learning_rate: learning rate of Newton's method
    :param epochs: the epochs to update the mode in Newton's method
    :return: the mode of the posterior distribution
    """
    # Get the dimension
    N = X.shape[0]
    
    # Calculate the covariance matrix C_N
    C_N = rbf_kernel(X, gamma=gamma) + nu * np.eye(N)
    
    # Find the eigenvalues and eigenvectors of C_N
    eigenvalues, _ = np.linalg.eig(C_N)
    
    # Initialise random mode
    mode = np.random.uniform(size=N)
    
    # Use Newton-Raphson method to iterative update the mode
    costs = []
    for epoch in range(epochs):
        # Update the mode
        sigma_N = sigmoid(mode)
        W_N = np.diag(sigma_N * (1 - sigma_N))
        mode_new = C_N @ np.linalg.solve(np.eye(N) + W_N @ C_N, t - sigma_N + W_N @ mode)
        
        # Update the mode
        mode = mode + learning_rate * (mode_new - mode)

        # Calculate the cost
        cost = (-0.5 * np.dot(mode, np.linalg.solve(C_N, mode)) - 0.5 * np.sum(np.log(eigenvalues)) 
                + np.dot(t, mode)) - np.sum(np.log(1 + np.exp(mode)))
        
        # Append the cost and display
        print(f"EPOCHS: {epoch} - Cost: {cost}")
        costs.append(cost)
        
    return mode, costs
    
    
def find_posterior_covariance(X, t, mode, nu, gamma):
    """
    Find the covariance matrix of the distribution p(a_N | t_N)
    :param X: train features
    :param t: train labels
    :param mode: mode of the distribution
    :param nu: added term to C, the covariance matrix
    :param gamma: the parameter of RBF kernel function
    :return: the covariance matrix of the distribution
    """
    # Get the dimension
    N = X.shape[0]
    
    # Calculate the covariance matrix C_N
    C_N = rbf_kernel(X, gamma=gamma) + nu * np.eye(N)
    
    # Calculate W_N
    sigma_N = sigmoid(mode)
    W_N = np.diag(sigma_N * (1 - sigma_N))
    
    # Calculate the Hessian
    H = W_N + np.linalg.inv(C_N)
    return np.linalg.inv(H)


def prediction_function(X_test, X_train, t, mode, nu, gamma):
    """
    Predict p(t_{N+1} | t_N)
    :param X_test: the test points
    :param X_train: the train features
    :param t: the train labels
    :param mode: the mode of p(a_{N+1} | t_N)
    :param gamma: the parameter of RBF kernel function
    :param nu: additive term to the covariance matrix C_N
    :return: the prediction
    """
    # Get the dimension
    N = X_train.shape[0]
    
    # Calculate the covariance matrix C_N
    C_N = rbf_kernel(X_train, gamma=gamma) + nu * np.eye(N)
    
    # Calculate the matrix W_N
    sigma_N = sigmoid(mode)
    W_N = np.diag(sigma_N * (1 - sigma_N))
    W_N_inv = np.linalg.inv(W_N)
    
    # Calculate the prediction
    predictions = []
    for x_test in X_test:
        # Calculate k
        k = rbf_kernel(X, x_test.reshape(1, -1), gamma=gamma).reshape(-1)
        
        # Calculate the mean and variance
        mean = np.dot(k, t - sigma_N)
        var = (rbf_kernel(x_test.reshape(1, -1), x_test.reshape(1, -1))[0][0] 
               - np.dot(k, np.linalg.solve(W_N_inv + C_N, k)))
        
        # Calculate the prediction
        prediction = sigmoid(mean / np.sqrt(1 + np.pi * var / 8))
        predictions.append(prediction)
        
    return np.array(predictions)

Formulation

Approximating $p(a_{N+1} | \mathbf{t}_N)$

Code implementation

Leave a Comment Cancel Reply