Convolutional Neural Networks From Scratch

May 8, 2023


Understand Convolutional Neural Networks and the math behind Machine Learning (ML) and Artificial Intelligence (AI) by simply using python (code posted below). For this, we only need the Numpy library from python! This is a continuation of the previous article: Neural Networks From Scratch which describes how to obtain the mnist dataset and get started with Neural Networks. For your convenience, you can download the mnist dataset ("mndata.npz") below.

mndata.npz

The machine learning coding process has been mostly automated through python libraries such as tensorflow and pytorch. These models automate the backpropagation process. This allows for a higher level view of machine learning models; instead of coding the math behind machine learning, these libraries allow people to think more about the techniques and combination of techniques they would like to use in their machine learning model. It is common for certain tasks to use around a hundred layers.

First, lets go through common techniques and the mathematics of convolutional neural networks, which are much more common than basic neural networks. The basic idea of convolutions in machine learning is to iterate through the image rows and columns by the use of multiple filters which are the weights. Active pixels in a certain image location will result in higher scores for that filter which tracks scores across all locations of an image. We will be coding the matrix multiplication from scratch. Matrix multiplication is simply a series of multiplications between corresponding positions in a matrix, and then summing all the values across the correct dimensions or axes.

Suppose a single 3 x 3 filter contains values of 1, straight diagonally from the top left of the filter all the way to the bottom right of the filter. All the other values remain 0. 

Filter:

[[1, 0, 0],

 [0, 1, 0],

 [0, 0, 1]]

What occurs is that this filter is super imposed on a 3 x 3 portion of the input image. Starting at the 0th row and column of the input image, matrix multiplication is applied. A multiplication between shape [3,3] and shape [3,3] = [3,3] and then a sum across all 9 values will give 1 value. This value represents a score of that diagonal shape present in that portion of the input image. The filter is again matrix multiplied starting at the 0th row and 1st column of the input image. 

So an output of this filter iterated over the image can be thought of as a 2D array which represents the score of that diagonal shape being present at each location in the image. This is just one filter. In practice, multiple filters are used to iterate over the image. Each handwritten digit classification has a pattern of these shapes and shape locations across the input image. This is the pattern that convolutional neural networks learn in order to classify inputs. It's common to stack additional convolutional layers. This functions the same as the first convolutional filter layer, except it scores combinations of shape scores across the input image, instead of the raw shapes of an image.

It is common to use a technique called “padding”, which adds a border of values of zero to the input image. The benefits of this are that certain shapes that may be present at the edge of the raw input that may not be centered well in a filter. This padding allows the filters to see the edge of the input consistently across the image. An additional benefit of padding is that it can allow the input and output shapes to be equivalent. It is also important to note that the user can skip rows and columns if desired when iterating through the image; instead of shifting the filter to the next row or column, the user can jump two or three rows or columns over; this is called “stride”.

These filters are the weights that will be trained and will have corresponding biases. Therefore, the model figures out which shapes are relevant in this classification task just by using math. We just need to define the parameters for the model to use. We will choose the size of the filters and how many filters we want in this model. We can also choose the step size or “stride”.

The forward propagation is quite self explanatory as the example code is written from scratch, though the process is quite complex given all the filters. In summary, the first layer is composed of a nested for loop that represents the number of the output image rows, and output image columns, which is determined by the formula output row/column shape = (image rowcolumn size – filter rowcolumn size) // stepsize + 1. This applies to rows and columns. The “//” refers to integer division where the quotient value is rounded down to the nearest whole number. We are able to matrix multiply the entire kernel, including all its filters in one step. This results in one value for each of the filters; then the filters get iterated over all other portions of the image.

The output of the first layer is 3D due to the filters of the first kernel. It is easiest to think about the input as multiple images of the same size, with the number of images corresponding to the number of filters specified for the first kernel. A single filter of the second kernel must have the same depth and the first convolutional layer output. For example, if the first kernel has 16 filters, a single filter in the second kernel may be of shape [16,3,3]. This means that there are 16 [3,3] shape filters in a single filter for the second filter. This single [16,3,3] filter will be matrix multiplied across the [16,3,3] portions of the output of the first layer. This matrix of [16,3,3] can be repeated 32 times to create a 4D kernel shape of [32,16,3,3]. Lucky for us, because the input is going to be the same across all 32 filters in the second kernel, all 32 values can be computed in one step as python can do an item-wise multiplication if both matices have the same number of dimensions and matching corresponding dimensional sizes, except for the dimension where the item-wise multiplication occurs. After this multiplication a sum across all 32 [16,3,3] values will be computed. This matrix multiplication results in 32 values for one portion of the image. The resulting output row and column sizes are calculated the same as in the first layer: output dimension shape = (image rowcolumn size – filter rowcolumn size) // stepsize + 1. Though the kernel of the second layer may be 4D, the output is 3D consisting of a [filter, row, column] dimensions. The same RELU activation function is applied to the second convolutional layer.

After the two convolutional layers, we apply a fully connected layer in order to transform the convolutional output to the classification labels. This process is quite the same to the base example in Neural Networks From Scratch, except the input will have 3 dimensions. Once the output is converted to the shape as the labels, a softmax output activation is applied. This represents the predicted probability score for each of the label classifications for a given input image.

For this article, we will be using cross-entropy loss as described by Ahlad Kumar. Though not very intuitive, this treats the loss differently for incorrect label classifications than for the correct label classification. Although counterintuitive, this outperforms mathematically intuitive derivatives, such as in the previous article. This brings up a good point because results are more important than intuitive procedures, which is why it is used in this example. The fully connected layer is backpropagated much the same as in the Neural Networks From Scratch article.

After the fully connected layer is back propagated, we move to the convolutional layers. We must distinguish between the output of the convolutional layer and the kernel itself. It can be thought of like this:

inputimage * kern1 = kern1output;

kern1output * kern2 = kern2output;

Where the multiplication symbol signifies the convolutions which are matrix multiplications, which are a series of multiplications and additions. Each step is written out in the code, so this article will cover the highlights of the backpropagation process.

Backpropagation of the derivative of the loss with respect to the second kernel is a bit more straightforward. “dLdkk” is convolved over “dkkdkern2”, which is the output of the first layer after RELU activation. Though we may have forward propagated convolutional layers using a given step size, the back propagation process here does incorporate the step size because we are convolving the kern2output shape (“dLdkk”) over the kern1outputshape (“dkkdkern2”). If we use the output dimension formula without stepsize we get:

Output rowcolumn shape = input rowcolumn size - filter rowcolumn size + 1.

Here, the input is “dkkdkern2” and the filter is “dLdkk”, and we use the rows and column sizes from those matrices to calculate the output rows and columns.  The difference between the kern2output and the kern1output is the kern2 filter row and column size. We will be iterating over the number of filters in the second convolutional layer output in addition to iterating over the rows and columns. For each filter dimension in “dLdkern2”, we item-wise multiply the corresponding filter dimension of “dLdkk” which will be of shape [1, rsh2, rsw2] with the the convolutional portion of “dkkdkern2” which is of shape [kern1filts, rsh2, rsw2]. Once this multiplication is complete, we do a summation over the rsh2 and rsw2 dimensions (axis= 1 , 2) . This will give us “kern1filts” for the first convolution for the first filter dimension in kern2filts. This process is repeated across all portions of the input and all kern2filts dimensions to produce a shape of [kern2filts, kern1filts, fsh2, fsw2].

Due to the chain rule, in order to calculate dLdkern we must calculate the derivative of the loss with respect to the derivative of the kern1output. For this step, we must essentially reconstruct the input from matrices with the same sizes as the convolutional filter and the convolutional output, which are smaller matrices. This is done by reverse engineering the convolutional layer. Recall that in the forward propagation step, a kernel or filter is convolved across portions of an input image. When the kernel is first convolved over the first portion of the image, certain input pixels activate certain kernel pixels. Yet some of the same input pixels can activate different portions of the kernel, when the kernel is moved to the next portion of the image. In this step we are trying to calculate every input pixel’s contribution to the convolutional output. Therefore, we must keep track of every input pixel and every interaction it had with any portion of the kernel. We essentially sum up all these input pixel contributions in that layer after multiply them by the backpropagated loss which the input pixel contribution led to for a given filter dimension.

Once “dLdk” is computed, “dLdkern” can be computed much the same as “dLdkern2”. For each convolutional layer, biases take the shape of a vector based on the number of filters in the convolutional layer. These biases are added to each convolution step through the image. Since the biases are stand alone variables due to the addition operation, backpropagation process for biases are the same as in the Neural Networks From Scratch article. Our chain rule derivations only need to go as far as the trainable variables which in this case are the kern1, kern1bias, kern2, kern2bias, weights, bias.

For each convolutional layer, biases take the shape of a vector based on the number of filters in the convolutional layer. These biases are added to each convolution step through the image. Since the biases are stand alone variables due to the addition operation, backpropagation process for biases are the same as in the Neural Networks From Scratch article.

# this code is a bare-bones walk through of 2-layer CNN forward and backward propagation from scratch

import numpy as np

mnistlocation = “INSERT MNIST FILE LOCATION” # For example: “/Users/enrichmentcap/Downloads/mndata.npz”

trains, tests = 1000, 1000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset

trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]


##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset

##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()

##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255


np.random.seed(0)

classes = len(np.unique(trainlabels))

imw = trainimages.shape[2] 

imh = trainimages.shape[1]

lr = .0003


fsw = 8 #kernel1 width

fsh = 8 #kernel1 height

fsw2 = 4 #kernel2 width

fsh2 = 4 #kernel2 height

filts =  16 #kernel1 number of filters

filts2 = 8 #kernel2 number of filters

kern = np.random.rand(filts, fsh, fsw) - .5

kb = np.zeros(filts)

kern2 = np.random.rand(filts2, filts, fsh2, fsw2) - .5

kb2 = np.zeros(filts2)

step = 1 #step sizes of kernel convolutions

step2 = 2

rsw = (imw - fsw) // step + 1 #resulting shape width of kernel1 convolution

rsh = (imh - fsh) // step + 1 #resulting shape height of kernel1 convolution

rsw2 = (rsw - fsw2) // step2 + 1 #resulting shape width of kernel2 convolution

rsh2 = (rsh - fsh2) // step2 + 1 #resulting shape height of kernel2 convolution

w = np.random.rand(classes, filts2, rsh2, rsw2) - .5

b = np.random.rand(classes) - .5


for i in range(trains):

    xx, label, label[trainlabels[i]] = trainimages[i], np.zeros(classes), 1

    k = np.zeros((filts, rsh, rsw))

    kk = np.zeros((filts2, rsh2, rsw2))

    for j in range(rsh):

        for jj in range(rsw):

            k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb #k.shape = [filts,rsize,rsize]

    for j0 in range(rsh2):

        for jj0 in range(rsw2):

            kk[:,j0,jj0] = (kern2 * k[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

    x = (w * kk.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b


    y = np.exp(x) / np.sum(np.exp(x))

    dydx = -np.exp(x)[trainlabels[i]] * np.exp(x) / np.sum(np.exp(x))**2

    dydx[trainlabels[i]] = np.exp(x)[trainlabels[i]] * (np.sum(np.exp(x)) - np.exp(x)[trainlabels[i]]) / np.sum(np.exp(x))**2

    dLdy = 1 / y[trainlabels[i]]

    

    dLdx = dLdy * dydx

    dxdw, dxdb = kk, 1

    dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)

    dLdb = dLdx * dxdb

    w = w + dLdw * lr

    b = b + dLdb * lr


    dxdkk = w       #(classes, filts2*rsize2**2)

    dLdkk = (dLdx.reshape(classes, 1, 1, 1) * dxdkk).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error

    dLdkern2 =  np.zeros((filts2,filts,fsh2,fsw2))

    dkkdkern2 = k

    dkkdkb2 = 1

    

'''

(filts*rsize2**2,) x (filts*rsize2**2,) = (filts, fsize**2);

k.size is back propagated with k.size-fsize,

therefore the step cannot be backpropagated; the step should be incorporated in the dLdkk term, even if step is bigger than filter,

because the dLdkk term must cover all pixels in K since filts.size + dLdkk.size = k.size regardless of step size

dkkdkern2 = k; where every value of k already incorporates a step size. When backpropagating there is no need to do the same stride as in forward propagation.

'''


    for f in range(filts2):

        for j000 in range(fsh2):

            for jj000 in range(fsw2):

                dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))     

    kern2 = kern2 + dLdkern2 * lr

    dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2


    dkkdk = kern2 # shape [filts2, filts * fsize2**2]

    dLdk = np.zeros((filts, rsh, rsw))


'''

for every pixel in k, sums loss by every kern2 interaction and corresponding loss output;

start with input pixel, determine which positions in output the pixel participated in, and find corresponding kern2 position of interaction

'''

    for ooo in range(filts):

        for o in range(rsh2):

            for oo in range(rsw2):

                dLdk[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdk[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]

    dkdkern = xx

    dkdkb = 1

    dLdkern = np.zeros((filts,fsh,fsw))

    for j00 in range(fsh):

        for jj00 in range(fsw):

            dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))

    kern = kern + dLdkern * lr

    dLdkb = dLdk.sum(axis=(1,2)) * dkdkb 


checke = np.zeros(tests)               

for i in range(tests):

    xx = testimages[i]

    k = np.zeros((filts, rsh, rsw))

    kk = np.zeros((filts2, rsh2, rsw2))

    for j in range(rsh):

        for jj in range(rsw):

            k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw]).sum(axis=(1,2)) + kb #k.shape = [filts,rsize,rsize] = [16,21,21]

    for j0 in range(rsh2):

        for jj0 in range(rsw2):

            kk[:,j0,jj0] = (kern2 * k[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)   

    x = (w * kk).sum(axis=(1,2,3)) + b 

    if testlabels[i] == np.argmax(x):

        checke[i] = 1

print(len(np.flatnonzero(checke==1))/tests)