Bayesian Neural Networks – A comprehensive review

Bayesian Neural Networks (BNNs) represent a fusion of two powerful concepts in machine learning: neural networks and Bayesian inference. While traditional neural networks are deterministic and lack a natural way to account for uncertainty, Bayesian Neural Networks introduce a probabilistic framework that allows for the representation of uncertainty in both the model parameters and predictions.

In a Bayesian Neural Network, instead of having fixed weights as in a standard neural network, the weights are treated as probability distributions. This means that rather than providing a single point estimate for each weight, the model captures a range of possible values, expressing the inherent uncertainty in the parameters. We will look at a derivation of a simple neural networks, and the code implementation of this concept.

Derivation

Let us the regression problem of predicting a continuous target variable $t$ from a vector $\mathbf{x}$ of inputs. We will assume that the conditional distribution $p(t | \mathbf{x})$ is Gaussian, with the mean given by the output of a neural network model $y(\mathbf{x}, \mathbf{w})$, and with precision $\beta$. More concretely, we have that $$p(t | \mathbf{x}, \mathbf{w}, \beta) = N(t | y(\mathbf{x}, \mathbf{w}), \beta^{-1})$$

We choose a prior distribution on the weights $\mathbf{w}$, which is a multivariate Gaussian with precision matrix $\alpha \mathbf{I}$. More specifically, $$p(\mathbf{w} | \alpha) = N(\mathbf{w} | \mathbf{0}, \alpha^{-1} \mathbf{I})$$ The likelihood for the dataset is $$p(D | \mathbf{w}, \beta) = \prod_{n=1}^N N(t_n | y(\mathbf{x}_n, \mathbf{w}), \beta^{-1})$$ From the prior and the likelihood, we can calculate the posterior distribution of $\mathbf{w}$ $$p(\mathbf{w} | D, \alpha, \beta) \propto p(\mathbf{w} | \alpha) \, p(D | \mathbf{w}, \beta)$$ In the case of Bayesian Linear Regression, where $y(\mathbf{x}, \mathbf{w})$ is a linear function of $\mathbf{w}$, this quantity turns out to be Gaussian. However, in a neural network, where $y(\mathbf{x}, \mathbf{w})$ is an arbitrary function $\mathbf{w}$, the posterior will likely be non-Gaussian

Laplace Approximation

We can approximate the posterior of $\mathbf{w}$ using Laplace approximation, which involves finding a local minima, $\mathbf{w}_{MAP}$, of the log posterior $$\ln p(\mathbf{w} | D) = – \cfrac{\alpha}{2} \mathbf{w}^T\mathbf{w} \, – \, \cfrac{\beta}{2} \sum_{n=1}^N \{ y(\mathbf{x}_n, \mathbf{w}) \, – \, t_n \}^2 + \text{const}$$ This is only a local minima because the function is not guaranteed to be convex, like in the case of linear regression.

Having found a mode $\mathbf{w}_{MAP}$, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribution. The value is calculated using $$\mathbf{A} = – \nabla_{\mathbf{w}}^2 \ln p(\mathbf{w} | D) = \alpha \mathbf{I} + \beta \mathbf{H}$$ where $\mathbf{H}$ is the Hessian matrix comprising the second derivatives of the sum-of-squares error function with respect to the components of $\mathbf{w}$. We can approximate the value of $\mathbf{H}$ using methods discussed in upcoming sections. Once we have found $\mathbf{A}$, the approximated posterior distribution of $\mathbf{w}$ is $$q(\mathbf{w} | D) = N(\mathbf{w} | \mathbf{w}_{MAP}, \mathbf{A}^{-1})$$ The predictive distribution is calculated by “averaging” over $\mathbf{w}$: $$p(t | \mathbf{x}, D, \alpha, \beta) = \int p(t | \mathbf{w}, \mathbf{x}, \beta) \, p(\mathbf{w} | D, \alpha) d \mathbf{w}$$ Even with the Gaussian approximation to the posterior, this integration is still analytically intractable because $y(\mathbf{x}, \mathbf{w})$ may not be linear.

To handle this, we approximate the value of $y(\mathbf{x}, \mathbf{w})$ using $$y(\mathbf{x}, \mathbf{w}) \simeq y(\mathbf{x}, \mathbf{w}_{MAP}) + \mathbf{g}^T (\mathbf{w} \,-\, \mathbf{w}_{MAP})$$ where $\mathbf{g} = \nabla_{\mathbf{w}} y(\mathbf{x}, \mathbf{w})$ evaluated at $\mathbf{w}_{MAP}$.

Using the property of Gaussian distribution, we see that the predictive distribution, under this assumption becomes $$p(t | \mathbf{x}, D, \alpha, \beta) = N(t | y(\mathbf{x}, \mathbf{w}_{MAP}), \sigma^2(x))$$ where $\sigma^2(x) = \beta^{-1} + \mathbf{g}^T \mathbf{A}^{-1} \mathbf{g}$

Diagonal approximation of the Hessian $\mathbf{H}$

In this section, we will look at ways to approximate the Hessian $\mathbf{H}$ using a method called diagonal approximation. This method simply replaces the off-diagonal elements with zeros, so that the inverse of $\mathbf{H}$ is trivial to calculate.

Let us consider an error function that consists of a sum of term, one for each pattern in the dataset. For the regression case, this may be the squared error. The Hessian can then be obtained by considering one pattern at a time, and then summing the results over all patterns. For one term $E_n$, we can see that $$\cfrac{\partial^2 E_n}{\partial w_{ji}^2} = \cfrac{\partial^2 E_n}{\partial a_{j}^2} z_i ^ 2$$ To calculate the first term on the right hand side, first see that $$\cfrac{\partial E_n}{\partial a_j} = h'(a_j) \sum_{k} \cfrac{\partial E_n}{\partial a_k} w_{kj}$$ where $k$ are the units on the next layer of the network. Take the derivative again w.r.t $a_j$, and eliminate the off-diagonal terms of the Hessian, we have that $$\cfrac{\partial^2 E_n}{\partial a_{j}^2} = h'(a_j)^2 \sum_{k} w_{kj}^2 \cfrac{\partial^2 E_n}{\partial a_k^2} + h”(a_j) \sum_{k} w_{kj} \cfrac{\partial E_n}{\partial a_k}$$ With this formula, we can obtain the approximated Hessian using an algorithm similar to back-propagation.

Python Implementation

We will implement a simple 2-layer Bayesian neural network with sigmoid activation function to calculate the sinusoidal function. The code implementation is available https://github.com/SonPhatTranDeveloper/bayesian-neural-network

Below is the implementation of Bayesian Neural Network, with Hessian diagonal approximation

class BayesianNeuralNetwork:
    def __init__(self, input_size, hidden_size, alpha, beta):
        """
        Create a two-layer neural network
        NOTE:
        - The network does not include any bias b
        - The network uses the sigmoid activation function
        :param input_size: size of the input vector
        :param hidden_size: size of the hidden layer
        :param alpha: the prior precision of W
        :param beta: the likelihood precision of t
        """
        # Cache the size
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Cache the prior variance
        self.alpha = alpha
        self.beta = beta

        # Create the layer
        self.W1 = np.random.normal(size=(self.input_size, self.hidden_size))
        self.W2 = np.random.normal(size=(self.hidden_size, 1))

        # Create a cache
        self.cache = {}

    def forward(self, x_train, y_train):
        """
        Perform the forward pass of the neural network
        :param x_train: the training input of the neural network
        :param y_train: the training
        :return: the output of the neural network
        """
        # Calculate the output of the first layer
        a1 = x_train @ self.W1
        z1 = sigmoid(a1)

        # Calculate the output of the second layer
        a2 = z1 @ self.W2

        # Cache the values
        self.cache = {
            "x_train": x_train,
            "y_train": y_train,
            "a1": a1,
            "z1": z1,
            "a2": a2
        }

        # Calculate the error function
        score = (1 / 2) * np.sum((y_train.reshape(-1) - a2.reshape(-1)) ** 2) / y_train.shape[0] \
            + (1 / 2) * np.sum(self.W1 ** 2) + (1 / 2) * np.sum(self.W2 ** 2)

        return a2, score

    def predict(self, x_test_points, x_train, y_train):
        """
        Make full prediction, mean and variance for a datapoint
        :param x_test_points: the test datapoints
        :return: the predictions, mean and variance
        """
        # Hold the mean and variance
        means, variances = [], []

        # Get the Hessian w.r.t W1 and W2
        h_W1, h_W2 = self._get_hessian(x_train, y_train)
        h_W = np.append(h_W1, h_W2)
        A = self.alpha * np.ones(h_W1.shape[0] + h_W2.shape[0]) + self.beta * h_W

        # Go through each test points
        for x_test in x_test_points:
            #  Reshape x test
            x_test = x_test.reshape(-1, 1)

            # Calculate the mean
            y, d_W1, d_W2 = self._get_gradients_w_map(x_test)

            # Calculate the variance
            d_W = np.append(d_W1, d_W2)
            variance = (1 / self.beta) + np.dot(d_W, d_W / A)

            # Append the mean and variance
            means.append(y[0])
            variances.append(variance)

        return np.array(means).reshape(-1), np.array(variances).reshape(-1)

    def _get_gradients_w_map(self, x_test):
        """
        Get the gradient of the neural network with respect to the weights
        :param x_test: the test data point
        :return: the gradient with respect to the weights, with the prediction
        """
        # Calculate the output of the first layer
        a1 = x_test @ self.W1
        z1 = sigmoid(a1)

        # Calculate the output of the second layer
        a2 = z1 @ self.W2

        # Calculate the gradient w.r.t W1 and W2
        d_a2 = np.ones_like(a2)

        # Calculate the gradient w.r.t z1
        d_z1 = d_a2 @ self.W2.T

        # Calculate the gradient w.r.t W2
        d_W2 = z1.T @ d_a2

        # Calculate the gradient w.r.t a1
        d_a1 = d_z1 * sigmoid_derivative(a1)

        # Calculate the gradient w.r.t W1
        d_W1 = x_test.T @ d_a1
        return a2, d_W1, d_W2

    def _get_hessian(self, x_train, y_train):
        """
        Get the Hessian of the neural network w.r.t to W1 and W2
        :param x_train: Training features
        :param y_train: Training labels
        :return: the Hessian w.r.t to W1 and W2
        """
        # Perform forward propagation
        self.forward(x_train, y_train)

        # Get the w_map
        W1, W2 = self.W1, self.W2

        # Get the cached variables
        x_train, y_train, a1, z1, a2 = self.cache["x_train"], self.cache["y_train"], \
            self.cache["a1"], self.cache["z1"], self.cache["a2"]

        # Get the number of training examples
        N = x_train.shape[0]

        # Calculate the diagonal Hessian w.r.t W2
        h_W2 = np.sum(z1 ** 2, axis=0)

        # Calculate the diagonal Hessian w.r.t W1
        derivative_a = sigmoid_derivative(a1)
        second_derivative_a = sigmoid_second_derivative(a1)
        W2_repeat = np.repeat(W2.reshape(1, -1), N, axis=0)

        h_a1 = ((derivative_a ** 2) * (W2_repeat ** 2) +
                second_derivative_a * W2_repeat * (a2.reshape(-1) - y_train.reshape(-1)).reshape(-1, 1))

        h_W1 = (x_train ** 2).T @ h_a1

        # Return the Hessian w.r.t W1 and W2
        return h_W1.reshape(-1), h_W2.reshape(-1)

    def backward(self, learning_rate):
        """
        Perform back-propagation
        :param learning_rate: Learning rate of back-propagation
        :return: None
        """
        # Get cached values
        x_train, y_train, a1, z1, a2 = self.cache["x_train"], self.cache["y_train"], \
            self.cache["a1"], self.cache["z1"], self.cache["a2"]

        # Calculate the gradient w.r.t a2
        d_a2 = self.beta * (a2 - y_train.reshape(-1, 1)).reshape(-1, 1)

        # Calculate the gradient w.r.t z1
        d_z1 = d_a2 @ self.W2.T

        # Calculate the gradient w.r.t W2
        d_W2 = z1.T @ d_a2 + self.alpha * self.W2

        # Calculate the gradient w.r.t a1
        d_a1 = d_z1 * sigmoid_derivative(a1)

        # Calculate the gradient w.r.t W1
        d_W1 = x_train.T @ d_a1 + self.alpha * self.W1

        # Perform back-prop
        self.W1 -= learning_rate * d_W1
        self.W2 -= learning_rate * d_W2

Derivation

Laplace Approximation

Diagonal approximation of the Hessian $\mathbf{H}$

Python Implementation

Leave a Comment Cancel Reply