ResNet with Backpropagation from Scratch

May 10, 2023

In this article, we are going to code a ResNet and its backpropagation from scratch only using Numpy in Python. These have become widely used in mainstream ML projects. Previously, we have discussed solutions to Exploding Gradients. However, Vanishing Gradients become an issue when including more layers to ML models. Such is the case in object detection when it is common to come across models with over a hundred layers. The issue is that "loss" or output "error" information can be lost in the backpropagating process.

Say that a ML model is 100 layers, where the first layer is the input image and the 100th layer is the object detection and classification output layer. The trainable convolution kernel in the second layer must be trained based on the "loss" or output "error" information derived in the 100th layer. This information has to travel through all intermediate layers in a basic CNN. If the values of the weights are normalized between -1 and 1 or 0 and 1, then fractions will be multiplied by fractions, thereby continuously reducing the backpropagation output values towards zero (ex. 0.9 * 0.9 * 0.9 * ....) . This is how gradients can eventually vanish or tend towards values of 0. Residual Neural Networks (ResNet) solve this problem by using skip connections. These allow output layers to skip over layers and get added to any layer. For example, the output of the convolution in the second layer in the 100 layer example can skip all the intermediate layers and be added to the 99th layer. It can be added through a trainable or untrainable weight matrix which transforms the second layer into a size that is compatible with the 99th layer. The good thing about this is that the skip connections are fully differentiable and therefore can be backpropagated.

Though ResNets are meant for models with many layers, the code in this example is going to have 2 convolution layers, where the output of the first convolution is going to skip the 2nd convolution layer and be added to the fully connected dense layer output. This is right before the softmax activation function for the final classification output. In order for this to happen, the output of the first convolution needs to be transformed into a shape compatible with the fully connected, dense layer output. We accomplish this by creating a weight matrix only containing values of ones (np.ones()). This transformation matrix will maintain constant values throughout the training as the purpose here is to understand how backpropagation works when using ResNet skip connections. This ResNet was inspired by this article and incorporates techniques used in prior articles such as convolutions, batching, random number initialization, activation functions, and the adam optimizer. The commented code below should explain it all.

import numpy as np

mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here

# For example: “/Users/enrichmentcap/Downloads/mndata.npz”

trains, tests = 60000, 10000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset

trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]

##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset

##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()

##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255

np.random.seed(0)

classes = len(np.unique(trainlabels))

imw = trainimages.shape[2]

imh = trainimages.shape[1]

lr = .001

fsw = 8

fsh = 8

fsw2 = 4

fsh2 = 4

filts = 16

filts2 = 8

step = 1

step2 = 2

rsw = (imw - fsw) // step + 1

rsh = (imh - fsh) // step + 1

rsw2 = (rsw - fsw2) // step2 + 1

rsh2 = (rsh - fsh2) // step2 + 1

##kern = np.random.normal(0, np.sqrt(1 / (filts*fsw*fsh)), (filts,fsw,fsh)) #Lecun Normal Initializer;

kern = np.random.normal(0, np.sqrt(2 / (filts*fsw*fsh + filts*rsw*rsh)), (filts,fsw,fsh)) #globot

kb = np.zeros(filts)

##kern2 = np.random.normal(0, np.sqrt(1 / (filts2*filts*fsw2*fsh2)), (filts2,filts,fsw2,fsh2))

kern2 = np.random.normal(0, np.sqrt(2 / (filts2*filts*fsw2*fsh2 + filts2*rsw2*rsh2)), (filts2,filts,fsw2,fsh2))

kb2 = np.zeros(filts2)

##w = np.random.normal(0, np.sqrt(1 / (filts2*rsw2*rsh2)), (classes,filts2,rsw2,rsh2))

w = np.random.normal(0, np.sqrt(2 / (filts2*rsw2*rsh2 + classes)), (classes,filts2,rsw2,rsh2))

b = np.zeros(classes)

ws = np.ones((10, filts, rsh, rsw)) * .001 #weights for the skip connection

b1 = .9

b2 = .999

eps = 1e-7

mw = 0

vw = 0

mb = 0

vb = 0

mk = 0

vk = 0

mk2 = 0

vk2 = 0

mkb = 0

vkb = 0

mkb2 = 0

vkb2 = 0

epochs = 5

bs = 100

nbatch = trains // bs

loss = np.zeros(epochs*nbatch*bs)

lowerclip = 0 #for 'clip' function

upperclip = 1

for e in range(epochs):

print('epoch: %d'%e)

np.random.seed(e)

ibatch = np.random.permutation(trainimages[:trains])

np.random.seed(e)

lbatch = np.random.permutation(trainlabels[:trains])

for nb in range(nbatch):

bdLdw = np.zeros((bs, classes, filts2, rsh2, rsw2))

bdLdb = np.zeros((bs, classes))

bdLdkern2 = np.zeros((bs, filts2, filts, fsh2, fsw2))

bdLdkb2 = np.zeros((bs, filts2))

bdLdkern = np.zeros((bs, filts, fsh, fsw))

bdLdkb = np.zeros((bs, filts))

for i in range(bs):

xx, label, label[lbatch[nb*bs + i]] = ibatch[nb*bs + i], np.zeros(classes), 1

k = np.zeros((filts, rsh, rsw))

for j in range(rsh):

for jj in range(rsw):

k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k), 'clip': np.minimum(np.maximum(lowerclip, k), upperclip)} #leak a=1 is the same as none

kr = afk['relu']

xs = (ws * kr.reshape(1,filts,rsh,rsw)).sum(axis = (1,2,3)) #xs.shape = (classes,)

kk = np.zeros((filts2, rsh2, rsw2))

for j0 in range(rsh2):

for jj0 in range(rsw2):

kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk), 'clip': np.minimum(np.maximum(lowerclip, kk), upperclip)} #leak a=1 is the same as none

kkr = afkk['relu']

x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b #x.shape = (classes,)

xo = x + xs #skip connection occurs here; information from first convolution gets added here; this differentiable equation gets backpropped

y = np.exp(xo) / np.sum(np.exp(xo))

dydxo = -np.exp(xo)[lbatch[nb*bs + i]] * np.exp(xo) / np.sum(np.exp(xo))**2

dydxo[lbatch[nb*bs + i]] = np.exp(xo)[lbatch[nb*bs + i]] * (np.sum(np.exp(xo)) - np.exp(x)[lbatch[nb*bs + i]]) / np.sum(np.exp(xo))**2

dLdy = 1 / y[lbatch[nb*bs + i]] #loss function: ln(y[lbatch[nb*bs + i]])

dLdxo = dLdy * dydxo

dxodx = 1 #derivative of skip connection (xo = x + xs) with respect to x

dxodxs = 1 #derivative of skip connection (xo = x + xs) with respect to xs

dLdxs = dLdxo * dxodxs #backpropagation path1: chain rule from calculus / differential equations; backpropagates through the xs path from (xo = x + xs)

dxsdkr = ws

dLdx = dLdxo * dxodx #bbackpropagation path2: chain rule from calculus / differential equations; backpropagates through the x path from (xo = x + xs)

dxdw, dxdb = kkr, 1

dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)

dLdb = dLdx * dxdb

bdLdw[i] = dLdw

bdLdb[i] = dLdb

dxdkkr = w #(classes, filts2*rsize2**2)

dLdkkr = (dLdx.reshape(classes, 1, 1, 1) * dxdkkr).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error

afdkk = {'none': 1, 'sig' : kkr*(1-kkr), 'tanh': 1-kkr**2, 'relu': np.array(kk > 0, dtype = float), 'clip': np.array((kk > lowerclip) & (kk < upperclip), dtype = float)}

dkkrdkk = afdkk['relu']

dLdkk = dLdkkr * dkkrdkk

dkkdkern2 = kr

dLdkern2 = np.zeros((filts2,filts,fsh2,fsw2))

for f in range(filts2):

for j000 in range(fsh2):

for jj000 in range(fsw2):

dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))

bdLdkern2[i] = dLdkern2

dkkdkb2 = 1

dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2

bdLdkb2[i] = dLdkb2

dkkdkr = kern2 # shape [filts2, filts * fsize2**2]

dLdkr = np.zeros((filts, rsh, rsw))

for ooo in range(filts):

for o in range(rsh2):

for oo in range(rsw2):

dLdkr[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdkr[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]

dLdkr = dLdkr + (dLdxs.reshape(classes,1,1,1) * dxsdkr).sum(axis=0) #the beauty of ResNet backpropagation is right here!!!

#dLdkr is composed of both backpropagation path1 (dLdx * dxdkkr * dkkrdkk * dkkdkr) + backpropagation path2 (dLdxs * dxsdkr)

#the information from backpropagation path2 does not get backpropagated through path1, and does not suffer diminishing or vanishing gradients.

#all the Loss information from (dLdxs * dxsdkr) gets passed more directly to dLdkern which is used for training the first convolution (kern)

afdk = {'none': 1, 'sig' : kr*(1-kr), 'tanh': 1-kr**2, 'relu': np.array(k > 0, dtype = float), 'clip': np.array((k > lowerclip) & (k < upperclip), dtype = float)}

dkrdk = afdk['relu']

dLdk = dLdkr * dkrdk

dkdkern = xx

dLdkern = np.zeros((filts,fsh,fsw))

for j00 in range(fsh):

for jj00 in range(fsw):

dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))

bdLdkern[i] = dLdkern

dkdkb = 1

dLdkb = dLdk.sum(axis=(1,2)) * dkdkb

bdLdkb[i] = dLdkb

lrt = lr * (1 - b2**(e*nbatch + nb + 1))**.5 / (1 - b1**(e*nbatch + nb + 1))

mk = b1 * mk + (1 - b1) * bdLdkern.mean(axis=0)

vk = b2 * vk + (1 - b2) * bdLdkern.mean(axis=0)**2

kern = kern + lrt * mk / (vk**.5 + eps)

mkb = b1 * mkb + (1 - b1) * bdLdkb.mean(axis=0)

vkb = b2 * vkb + (1 - b2) * bdLdkb.mean(axis=0)**2

kb = kb + lrt * mkb / (vkb**.5 + eps)

mk2 = b1 * mk2 + (1 - b1) * bdLdkern2.mean(axis=0)

vk2 = b2 * vk2 + (1 - b2) * bdLdkern2.mean(axis=0)**2

kern2 = kern2 + lrt * mk2 / (vk2**.5 + eps)

mkb2 = b1 * mkb2 + (1 - b1) * bdLdkb2.mean(axis=0)

vkb2 = b2 * vkb2 + (1 - b2) * bdLdkb2.mean(axis=0)**2

kb2 = kb2 + lrt * mkb2 / (vkb2**.5 + eps)

mw = b1 * mw + (1 - b1) * bdLdw.mean(axis=0)

vw = b2 * vw + (1 - b2) * bdLdw.mean(axis=0)**2

w = w + lrt * mw / (vw**.5 + eps)

mb = b1 * mb + (1 - b1) * bdLdb.mean(axis=0)

vb = b2 * vb + (1 - b2) * bdLdb.mean(axis=0)**2

b = b + lrt * mb / (vb**.5 + eps)

checke = np.zeros(tests)

for i in range(tests):

xx = testimages[i]

k = np.zeros((filts, rsh, rsw))

for j in range(rsh):

for jj in range(rsw):

k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k), 'clip': np.minimum(np.maximum(lowerclip, k), upperclip)} #leak a=1 is the same as none

kr = afk['relu']

kr = k

kk = np.zeros((filts2, rsh2, rsw2))

for j0 in range(rsh2):

for jj0 in range(rsw2):

kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk), 'clip': np.minimum(np.maximum(lowerclip, kk), upperclip)} #leak a=1 is the same as none

kkr = afkk['relu']

x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b

if testlabels[i] == np.argmax(x):

checke[i] = 1

print(len(np.flatnonzero(checke==1))/tests)

The output of this code will produce the accuracy on the testing mnist dataset as a percentage decimal. The accuracy will not be as high as a the CNN's we have coded from scratch. This is because ResNet's are not meant for such small ML models. However, despite the skip connection, this model still demonstrates learning. If you can understand all the techniques used in this model from scratch, it may be a great time to move on to automated ML platforms that automate backpropagation and allow us to focus more on ML architectures.