Addressing Exploding Gradients in CNN's

May 8, 2023


When coding ML models from scratch, a major problem to deal with is Numpy runtime errors of invalid values and numerical overflows. Since ML commonly use exponential functions, it is inevitable that the weights in the neural network become too large or even too small. Additionally, as the training dataset becomes larger, layer values compound too much or decay too much for the computer to handle. This is why setting the correct learning rate can be very important in ML models made from scratch. When a model learns too fast, it can cause the gradients to compound or decay too much.

This article is a continuation of the Convolutional Neural Networks From Scratch article.  As we begin to understand the mechanics and mathematics of ML, we can begin to add features which make our ML models perform better. In the case of exploding gradients, we can clip values of a layer to ensure numbers stay within a range our computers can handle. This is known as gradient clipping.  Although similar to activation functions, gradient clipping allows more control over the numerical range of an output of a given layer. With the right parameters set (especially the learn rate), this can allow more thorough training over the dataset which then produces better results in general. 

The lower gradient clipping boundary can be established using the numpy.maximum() function and the upper gradient clipping boundary can be set using the numpy.minimum() function. The function would be something like:

if f(x) < lowerclipvalue, then f(x) = lowerclipvalue;

else if f(x) > upperclipvalue, then f(x) = upperclipvalue;

else f(x) = f(x).

The derivative of this function f(x) would be:

if f(x) < lowerclipvalue, then f'(x) = 0;

else if f(x) > upperclipvalue, then f'(x) = 0;

else f'(x) = 1.

Gradient clipping is very useful for avoiding numpy overflows and invalid values. This allow the model to train longer. Though gradient clipping doesn't immediately create better results, when used in combination with other techniques, it can prove very useful. Now that the problem of exploding gradients can be partially solved, we can move on to other techniques that improve the effectiveness of ML models.


# this code is a walk through of 2-layer CNN forward and backward propagation from scratch, using the adam optimizer

import numpy as np

mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here

# For example: “/Users/enrichmentcap/Downloads/mndata.npz”

trains, tests = 1000, 1000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset

trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]


##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset

##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()

##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255


np.random.seed(0)

classes = len(np.unique(trainlabels))

imw = trainimages.shape[2] 

imh = trainimages.shape[1]


lr = .001


fsw = 8

fsh = 8

fsw2 = 4

fsh2 = 4

filts =  16

filts2 = 8

step = 1

step2 = 2

rsw = (imw - fsw) // step + 1

rsh = (imh - fsh) // step + 1

rsw2 = (rsw - fsw2) // step2 + 1

rsh2 = (rsh - fsh2) // step2 + 1

kern = np.random.rand(filts, fsh, fsw) - .5

kb = np.zeros(filts)

kern2 = np.random.rand(filts2, filts, fsh2, fsw2) - .5

kb2 = np.zeros(filts2)

w = np.random.rand(classes, filts2, rsh2, rsw2) - .5

b = np.random.rand(classes) - .5


b1 = .9

b2 = .999

eps = 1e-7

mw = 0

vw = 0

mb = 0

vb = 0

mk = 0

vk = 0

mk2 = 0

vk2 = 0

mkb = 0

vkb = 0

mkb2 = 0

vkb2 = 0


upperclip = 1

lowerclip = -1


for i in range(trains):

    xx, label, label[trainlabels[i]] = trainimages[i], np.zeros(classes), 1

    

    k = np.zeros((filts, rsh, rsw))

    for j in range(rsh):

        for jj in range(rsw):

            k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

    kr = np.minimum(np.maximum(lowerclip, k), upperclip)


    kk = np.zeros((filts2, rsh2, rsw2))

    for j0 in range(rsh2):

        for jj0 in range(rsw2):

            kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

    kkr = np.minimum(np.maximum(lowerclip, kk), upperclip)

    

    x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b

    y = np.exp(x) / np.sum(np.exp(x))

    dydx = -np.exp(x)[trainlabels[i]] * np.exp(x) / np.sum(np.exp(x))**2

    dydx[trainlabels[i]] = np.exp(x)[trainlabels[i]] * (np.sum(np.exp(x)) - np.exp(x)[trainlabels[i]]) / np.sum(np.exp(x))**2

    dLdy = 1 / y[trainlabels[i]]

    

    dLdx = dLdy * dydx

    dxdw, dxdb = kkr, 1


    dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)

    dLdb = dLdx * dxdb


    lrt = lr * (1 - b2**(i+1))**.5 / (1 - b1**(i+1))        #as training increases lrt starts high goes low real fast then gradually heads towards 1

    mw = b1 * mw + (1 - b1) * dLdw                                              #takes large ratio of a small ratio of the previous slope,  and adds to a small ratio of current slope

    vw = b2 * vw + (1 - b2) * dLdw**2                                           #takes really large ratio of a really small ratio of the previous slope squared, and adds really small ratio of current slope squared

    w = w + lrt * mw / (vw**.5 + eps)                                           #better prelim results with abs(dLdw) in vw and removing square root of vw in w; independent from prior mw and vw, this results is a tanh shape more intuitive for what one would like as slope increases or decreases


    mb = b1 * mb + (1 - b1) * dLdb

    vb = b2 * vb + (1 - b2) * dLdb**2

    b = b + lrt * mb / (vb**.5 + eps)

    

    dxdkkr = w       #(classes, filts2*rsize2**2)

    dLdkkr = (dLdx.reshape(classes, 1, 1, 1) * dxdkkr).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error

    dkkrdkk = np.array((kk > -1) & (kk < 1), dtype = float)

    dLdkk = dLdkkr * dkkrdkk


    dkkdkern2 = kr

    dLdkern2 =  np.zeros((filts2,filts,fsh2,fsw2))

    for f in range(filts2):

        for j000 in range(fsh2):

            for jj000 in range(fsw2):

                dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))     


    mk2 = b1 * mk2 + (1 - b1) * dLdkern2

    vk2 = b2 * vk2 + (1 - b2) * dLdkern2**2

    kern2 = kern2 + lrt * mk2 / (vk2**.5 + eps)


    dkkdkb2 = 1

    dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2


    mkb2 = b1 * mkb2 + (1 - b1) * dLdkb2

    vkb2 = b2 * vkb2 + (1 - b2) * dLdkb2**2

    kb2 = kb2 + lrt * mkb2 / (vkb2**.5 + eps)


    dkkdkr = kern2 # shape [filts2, filts * fsize2**2]

    dLdkr = np.zeros((filts, rsh, rsw))

    for ooo in range(filts):

        for o in range(rsh2):

            for oo in range(rsw2):

                dLdkr[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdkr[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]

    dkrdk = np.array((k > -1) & (k < 1), dtype = float)

    dLdk = dLdkr * dkrdk

    

    dkdkern = xx

    dLdkern = np.zeros((filts,fsh,fsw))

    for j00 in range(fsh):

        for jj00 in range(fsw):

            dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))


    mk = b1 * mk + (1 - b1) * dLdkern

    vk = b2 * vk + (1 - b2) * dLdkern**2

    kern = kern + lrt * mk / (vk**.5 + eps)

    

    dkdkb = 1

    dLdkb = dLdk.sum(axis=(1,2)) * dkdkb


    mkb = b1 * mkb + (1 - b1) * dLdkb

    vkb = b2 * vkb + (1 - b2) * dLdkb**2

    kb = kb + lrt * mkb / (vkb**.5 + eps)

         


checke = np.zeros(tests)               

for i in range(tests):

    xx = testimages[i]

    k = np.zeros((filts, rsh, rsw))

    for j in range(rsh):

        for jj in range(rsw):

            k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

    kr = np.maximum(0, k)


    kk = np.zeros((filts2, rsh2, rsw2))

    for j0 in range(rsh2):

        for jj0 in range(rsw2):

            kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

    kkr = np.maximum(0, kk)

    

    x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b

    if testlabels[i] == np.argmax(x):

        checke[i] = 1

print(len(np.flatnonzero(checke==1))/tests)