ResNet with Backpropagation from Scratch
May 10, 2023
In this article, we are going to code a ResNet and its backpropagation from scratch only using Numpy in Python. These have become widely used in mainstream ML projects. Previously, we have discussed solutions to Exploding Gradients. However, Vanishing Gradients become an issue when including more layers to ML models. Such is the case in object detection when it is common to come across models with over a hundred layers. The issue is that "loss" or output "error" information can be lost in the backpropagating process.
Say that a ML model is 100 layers, where the first layer is the input image and the 100th layer is the object detection and classification output layer. The trainable convolution kernel in the second layer must be trained based on the "loss" or output "error" information derived in the 100th layer. This information has to travel through all intermediate layers in a basic CNN. If the values of the weights are normalized between -1 and 1 or 0 and 1, then fractions will be multiplied by fractions, thereby continuously reducing the backpropagation output values towards zero (ex. 0.9 * 0.9 * 0.9 * ....) . This is how gradients can eventually vanish or tend towards values of 0. Residual Neural Networks (ResNet) solve this problem by using skip connections. These allow output layers to skip over layers and get added to any layer. For example, the output of the convolution in the second layer in the 100 layer example can skip all the intermediate layers and be added to the 99th layer. It can be added through a trainable or untrainable weight matrix which transforms the second layer into a size that is compatible with the 99th layer. The good thing about this is that the skip connections are fully differentiable and therefore can be backpropagated.
Though ResNets are meant for models with many layers, the code in this example is going to have 2 convolution layers, where the output of the first convolution is going to skip the 2nd convolution layer and be added to the fully connected dense layer output. This is right before the softmax activation function for the final classification output. In order for this to happen, the output of the first convolution needs to be transformed into a shape compatible with the fully connected, dense layer output. We accomplish this by creating a weight matrix only containing values of ones (np.ones()). This transformation matrix will maintain constant values throughout the training as the purpose here is to understand how backpropagation works when using ResNet skip connections. This ResNet was inspired by this article and incorporates techniques used in prior articles such as convolutions, batching, random number initialization, activation functions, and the adam optimizer. The commented code below should explain it all.
import numpy as np
mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here
# For example: “/Users/enrichmentcap/Downloads/mndata.npz”
trains, tests = 60000, 10000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset
trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]
##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset
##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()
##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255
np.random.seed(0)
classes = len(np.unique(trainlabels))
imw = trainimages.shape[2]
imh = trainimages.shape[1]
lr = .001
fsw = 8
fsh = 8
fsw2 = 4
fsh2 = 4
filts = 16
filts2 = 8
step = 1
step2 = 2
rsw = (imw - fsw) // step + 1
rsh = (imh - fsh) // step + 1
rsw2 = (rsw - fsw2) // step2 + 1
rsh2 = (rsh - fsh2) // step2 + 1
##kern = np.random.normal(0, np.sqrt(1 / (filts*fsw*fsh)), (filts,fsw,fsh)) #Lecun Normal Initializer;
kern = np.random.normal(0, np.sqrt(2 / (filts*fsw*fsh + filts*rsw*rsh)), (filts,fsw,fsh)) #globot
kb = np.zeros(filts)
##kern2 = np.random.normal(0, np.sqrt(1 / (filts2*filts*fsw2*fsh2)), (filts2,filts,fsw2,fsh2))
kern2 = np.random.normal(0, np.sqrt(2 / (filts2*filts*fsw2*fsh2 + filts2*rsw2*rsh2)), (filts2,filts,fsw2,fsh2))
kb2 = np.zeros(filts2)
##w = np.random.normal(0, np.sqrt(1 / (filts2*rsw2*rsh2)), (classes,filts2,rsw2,rsh2))
w = np.random.normal(0, np.sqrt(2 / (filts2*rsw2*rsh2 + classes)), (classes,filts2,rsw2,rsh2))
b = np.zeros(classes)
ws = np.ones((10, filts, rsh, rsw)) * .001 #weights for the skip connection
b1 = .9
b2 = .999
eps = 1e-7
mw = 0
vw = 0
mb = 0
vb = 0
mk = 0
vk = 0
mk2 = 0
vk2 = 0
mkb = 0
vkb = 0
mkb2 = 0
vkb2 = 0
epochs = 5
bs = 100
nbatch = trains // bs
loss = np.zeros(epochs*nbatch*bs)
lowerclip = 0 #for 'clip' function
upperclip = 1
for e in range(epochs):
print('epoch: %d'%e)
np.random.seed(e)
ibatch = np.random.permutation(trainimages[:trains])
np.random.seed(e)
lbatch = np.random.permutation(trainlabels[:trains])
for nb in range(nbatch):
bdLdw = np.zeros((bs, classes, filts2, rsh2, rsw2))
bdLdb = np.zeros((bs, classes))
bdLdkern2 = np.zeros((bs, filts2, filts, fsh2, fsw2))
bdLdkb2 = np.zeros((bs, filts2))
bdLdkern = np.zeros((bs, filts, fsh, fsw))
bdLdkb = np.zeros((bs, filts))
for i in range(bs):
xx, label, label[lbatch[nb*bs + i]] = ibatch[nb*bs + i], np.zeros(classes), 1
k = np.zeros((filts, rsh, rsw))
for j in range(rsh):
for jj in range(rsw):
k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb
afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k), 'clip': np.minimum(np.maximum(lowerclip, k), upperclip)} #leak a=1 is the same as none
kr = afk['relu']
xs = (ws * kr.reshape(1,filts,rsh,rsw)).sum(axis = (1,2,3)) #xs.shape = (classes,)
kk = np.zeros((filts2, rsh2, rsw2))
for j0 in range(rsh2):
for jj0 in range(rsw2):
kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)
afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk), 'clip': np.minimum(np.maximum(lowerclip, kk), upperclip)} #leak a=1 is the same as none
kkr = afkk['relu']
x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b #x.shape = (classes,)
xo = x + xs #skip connection occurs here; information from first convolution gets added here; this differentiable equation gets backpropped
y = np.exp(xo) / np.sum(np.exp(xo))
dydxo = -np.exp(xo)[lbatch[nb*bs + i]] * np.exp(xo) / np.sum(np.exp(xo))**2
dydxo[lbatch[nb*bs + i]] = np.exp(xo)[lbatch[nb*bs + i]] * (np.sum(np.exp(xo)) - np.exp(x)[lbatch[nb*bs + i]]) / np.sum(np.exp(xo))**2
dLdy = 1 / y[lbatch[nb*bs + i]] #loss function: ln(y[lbatch[nb*bs + i]])
dLdxo = dLdy * dydxo
dxodx = 1 #derivative of skip connection (xo = x + xs) with respect to x
dxodxs = 1 #derivative of skip connection (xo = x + xs) with respect to xs
dLdxs = dLdxo * dxodxs #backpropagation path1: chain rule from calculus / differential equations; backpropagates through the xs path from (xo = x + xs)
dxsdkr = ws
dLdx = dLdxo * dxodx #bbackpropagation path2: chain rule from calculus / differential equations; backpropagates through the x path from (xo = x + xs)
dxdw, dxdb = kkr, 1
dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)
dLdb = dLdx * dxdb
bdLdw[i] = dLdw
bdLdb[i] = dLdb
dxdkkr = w #(classes, filts2*rsize2**2)
dLdkkr = (dLdx.reshape(classes, 1, 1, 1) * dxdkkr).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error
afdkk = {'none': 1, 'sig' : kkr*(1-kkr), 'tanh': 1-kkr**2, 'relu': np.array(kk > 0, dtype = float), 'clip': np.array((kk > lowerclip) & (kk < upperclip), dtype = float)}
dkkrdkk = afdkk['relu']
dLdkk = dLdkkr * dkkrdkk
dkkdkern2 = kr
dLdkern2 = np.zeros((filts2,filts,fsh2,fsw2))
for f in range(filts2):
for j000 in range(fsh2):
for jj000 in range(fsw2):
dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))
bdLdkern2[i] = dLdkern2
dkkdkb2 = 1
dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2
bdLdkb2[i] = dLdkb2
dkkdkr = kern2 # shape [filts2, filts * fsize2**2]
dLdkr = np.zeros((filts, rsh, rsw))
for ooo in range(filts):
for o in range(rsh2):
for oo in range(rsw2):
dLdkr[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdkr[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]
dLdkr = dLdkr + (dLdxs.reshape(classes,1,1,1) * dxsdkr).sum(axis=0) #the beauty of ResNet backpropagation is right here!!!
#dLdkr is composed of both backpropagation path1 (dLdx * dxdkkr * dkkrdkk * dkkdkr) + backpropagation path2 (dLdxs * dxsdkr)
#the information from backpropagation path2 does not get backpropagated through path1, and does not suffer diminishing or vanishing gradients.
#all the Loss information from (dLdxs * dxsdkr) gets passed more directly to dLdkern which is used for training the first convolution (kern)
afdk = {'none': 1, 'sig' : kr*(1-kr), 'tanh': 1-kr**2, 'relu': np.array(k > 0, dtype = float), 'clip': np.array((k > lowerclip) & (k < upperclip), dtype = float)}
dkrdk = afdk['relu']
dLdk = dLdkr * dkrdk
dkdkern = xx
dLdkern = np.zeros((filts,fsh,fsw))
for j00 in range(fsh):
for jj00 in range(fsw):
dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))
bdLdkern[i] = dLdkern
dkdkb = 1
dLdkb = dLdk.sum(axis=(1,2)) * dkdkb
bdLdkb[i] = dLdkb
lrt = lr * (1 - b2**(e*nbatch + nb + 1))**.5 / (1 - b1**(e*nbatch + nb + 1))
mk = b1 * mk + (1 - b1) * bdLdkern.mean(axis=0)
vk = b2 * vk + (1 - b2) * bdLdkern.mean(axis=0)**2
kern = kern + lrt * mk / (vk**.5 + eps)
mkb = b1 * mkb + (1 - b1) * bdLdkb.mean(axis=0)
vkb = b2 * vkb + (1 - b2) * bdLdkb.mean(axis=0)**2
kb = kb + lrt * mkb / (vkb**.5 + eps)
mk2 = b1 * mk2 + (1 - b1) * bdLdkern2.mean(axis=0)
vk2 = b2 * vk2 + (1 - b2) * bdLdkern2.mean(axis=0)**2
kern2 = kern2 + lrt * mk2 / (vk2**.5 + eps)
mkb2 = b1 * mkb2 + (1 - b1) * bdLdkb2.mean(axis=0)
vkb2 = b2 * vkb2 + (1 - b2) * bdLdkb2.mean(axis=0)**2
kb2 = kb2 + lrt * mkb2 / (vkb2**.5 + eps)
mw = b1 * mw + (1 - b1) * bdLdw.mean(axis=0)
vw = b2 * vw + (1 - b2) * bdLdw.mean(axis=0)**2
w = w + lrt * mw / (vw**.5 + eps)
mb = b1 * mb + (1 - b1) * bdLdb.mean(axis=0)
vb = b2 * vb + (1 - b2) * bdLdb.mean(axis=0)**2
b = b + lrt * mb / (vb**.5 + eps)
checke = np.zeros(tests)
for i in range(tests):
xx = testimages[i]
k = np.zeros((filts, rsh, rsw))
for j in range(rsh):
for jj in range(rsw):
k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb
afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k), 'clip': np.minimum(np.maximum(lowerclip, k), upperclip)} #leak a=1 is the same as none
kr = afk['relu']
kr = k
kk = np.zeros((filts2, rsh2, rsw2))
for j0 in range(rsh2):
for jj0 in range(rsw2):
kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)
afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk), 'clip': np.minimum(np.maximum(lowerclip, kk), upperclip)} #leak a=1 is the same as none
kkr = afkk['relu']
x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b
if testlabels[i] == np.argmax(x):
checke[i] = 1
print(len(np.flatnonzero(checke==1))/tests)
The output of this code will produce the accuracy on the testing mnist dataset as a percentage decimal. The accuracy will not be as high as a the CNN's we have coded from scratch. This is because ResNet's are not meant for such small ML models. However, despite the skip connection, this model still demonstrates learning. If you can understand all the techniques used in this model from scratch, it may be a great time to move on to automated ML platforms that automate backpropagation and allow us to focus more on ML architectures.