Value of Random Numbers in ML
May 9, 2023
In this article, we are going to code the Glorot Initializer and the He Initializer to demonstrate how valuable random numbers can be in ML.
In the previous articles, we coded a neural network, a two-layer convolutional neural network, the adam optimizer, activation functions, and gradient clipping all from scratch. We we're able to reach an accuracy of over 90% in a simple neural network. Yet, the CNN's we have produced haven't been able to surpass the basic neural network. This is because CNN's require an arsenal of techniques before starting to generate meaningful results. One valuable technique that not only improves performance of a ML model, but also helps with exploding gradients (the biggest nuisance so far in ML programming from scratch). So far, we have used a uniform distribution of random numbers between -1 and 1; or to be exact -.5 to .5 which is just simpler to code. After all, there just random numbers right. Actually, one would think that random numbers are only useful in ML because it allows the individual weights to be naturally differentiated as the model continues to train. If the weights all started at zero, then across multiple layers, matrix multiplication would also result in zeros, not giving any useful information. However, random numbers can do much more than that. Researchers were able to discover a way to find a set of random numbers that would be generally optimal for a given ML model. These random numbers incorporate variance of the trainable weights, and the sizes of the layers surrounding the weights. The math behind most of these initializers are very similar and the bottom line is that they work. In general, the He Initializer is for a weights that take an input which has just been RELU activated which is explained in the paper linked above. The Glorot initializer is for initializing weights that take an input which has been sigmoid or tanh activated.
ML models are very diverse and have a lot of parameters that change their dynamics. It's nice to analyze what works and fine tune according to what may work better. Keras, which is built on top of tensorflow provides sourcecode which shows how popular initializers are implemented. When you begin to use more automated machine learning models, it becomes much more difficult to customize and try out ideas. Below is my take on the Globot and He Initializers for a two-layer CNN that uses RELU activation functions and the adam optimizer.
# this code is a walk through of 2-layer CNN forward and backward propagation from scratch, using the adam optimizer
import numpy as np
mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here
# For example: “/Users/enrichmentcap/Downloads/mndata.npz”
trains, tests = 1000, 1000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset
trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]
##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset
##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()
##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255
np.random.seed(0)
classes = len(np.unique(trainlabels))
imw = trainimages.shape[2]
imh = trainimages.shape[1]
lr = .001
fsw = 8
fsh = 8
fsw2 = 4
fsh2 = 4
filts = 16
filts2 = 8
step = 1
step2 = 2
rsw = (imw - fsw) // step + 1
rsh = (imh - fsh) // step + 1
rsw2 = (rsw - fsw2) // step2 + 1
rsh2 = (rsh - fsh2) // step2 + 1
## I found that for best results with convolutional layers: to regard the input layer (aka fan_in) as the size of the convolutional kernel; and regard the output layer (aka fan_out) as the size of the output of the convolution layer
kern = np.random.normal(0, np.sqrt(2 / (filts*fsw*fsh + filts*rsw*rsh)), (filts,fsw,fsh)) #using Globot Normal Initializer as input layer is not RELU activated
##kern = np.random.uniform(-1,1,(filts,fsw,fsh)) * np.sqrt(6 / (filts*fsw*fsh + filts*rsw*rsh)) #Globot Uniform Initializer
##kern = np.random.normal(0, np.sqrt(2 / (filts*fsw*fsh)), (filts,fsw,fsh)) #He Normal Initializer
##kern = np.random.uniform(-1,1,(filts,fsw,fsh)) * np.sqrt(6 / (filts*fsw*fsh)) #He Uniform Initializer
##kern = np.random.normal(0, np.sqrt(1 / (filts*fsw*fsh)), (filts,fsw,fsh)) #Lecun Normal Initializer; meant for SELU activation function
##kern = np.random.uniform(-1,1,(filts,fsw,fsh)) * np.sqrt(3 / (filts*fsw*fsh)) #Lecun Uniform Initializer; meant for SELU activation function
kb = np.zeros(filts) #all biases are initialized as zeros for simplicity
kern2 = np.random.normal(0, np.sqrt(2 / (filts2*filts*fsw2*fsh2)), (filts2,filts,fsw2,fsh2)) #using He Normal Initializer as input is RELU activated
##kern2 = np.random.uniform(-1,1, (filts2,filts,fsw2,fsh2)) * np.sqrt(6 / (filts2*filts*fsw2*fsh2)) #He Uniform Initializer
##kern2 = np.random.normal(0, np.sqrt(2 / (filts2*filts*fsw2*fsh2 + filts2*rsw2*rsh2)), (filts2,filts,fsw2,fsh2)) #Globot Normal Initializer
##kern2 = np.random.uniform(-1,1, (filts2,filts,fsw2,fsh2)) * np.sqrt(6 / ((filts2*filts*fsw2*fsh2 + filts2*rsw2*rsh2))) #Globot Uniform Initializer
##kern2 = np.random.normal(0, np.sqrt(1 / (filts2*filts*fsw2*fsh2)), (filts2,filts,fsw2,fsh2)) #Lecun Normal Initializer
##kern2 = np.random.uniform(-1,1, (filts2,filts,fsw2,fsh2)) * np.sqrt(3 / (filts2*filts*fsw2*fsh2)) #Lecun Uniform Initializer
kb2 = np.zeros(filts2)
## For fully connected, dense layers: the input layer (fan_in) is the size of the preceeding layer; the output layer (fan_out) is the size of the output of the fully connected, dense layer.
w = np.random.normal(0, np.sqrt(2 / (filts2*rsw2*rsh2)), (classes,filts2,rsw2,rsh2)) #Using He Normal Initializer, as input is RELU activated
##w = np.random.uniform(-1,1, (classes,filts2,rsw2,rsh2)) * np.sqrt(2 / (filts2*rsw2*rsh2))
##w = np.random.normal(0, np.sqrt(2 / (filts2*rsw2*rsh2 + classes)), (classes,filts2,rsw2,rsh2)) #Globot Normal Initializer
##w= np.random.uniform(-1,1, (classes,filts2,rsw2,rsh2)) * np.sqrt(6 / ((filts2*rsw2*rsh2 + classes)))
#Globot Uniform Initializer
##w = np.random.normal(0, np.sqrt(1 / (filts2*rsw2*rsh2)), (classes,filts2,rsw2,rsh2)) #Lecun Normal Initializer
##w = np.random.uniform(-1,1, (classes,filts2,rsw2,rsh2)) * np.sqrt(3 / (filts2*rsw2*rsh2)) #Lecun Uniform Initializer
b = np.zeros(classes)
b1 = .9
b2 = .999
eps = 1e-7
mw = 0
vw = 0
mb = 0
vb = 0
mk = 0
vk = 0
mk2 = 0
vk2 = 0
mkb = 0
vkb = 0
mkb2 = 0
vkb2 = 0
for i in range(trains):
xx, label, label[trainlabels[i]] = trainimages[i], np.zeros(classes), 1
k = np.zeros((filts, rsh, rsw))
for j in range(rsh):
for jj in range(rsw):
k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb
kr = np.maximum(0, k) # RELU activation function
kk = np.zeros((filts2, rsh2, rsw2))
for j0 in range(rsh2):
for jj0 in range(rsw2):
kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)
kkr = np.maximum(0, kk) # RELU activation function
x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b
y = np.exp(x) / np.sum(np.exp(x))
dydx = -np.exp(x)[trainlabels[i]] * np.exp(x) / np.sum(np.exp(x))**2
dydx[trainlabels[i]] = np.exp(x)[trainlabels[i]] * (np.sum(np.exp(x)) - np.exp(x)[trainlabels[i]]) / np.sum(np.exp(x))**2
dLdy = 1 / y[trainlabels[i]]
dLdx = dLdy * dydx
dxdw, dxdb = kkr, 1
dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)
dLdb = dLdx * dxdb
lrt = lr * (1 - b2**(i+1))**.5 / (1 - b1**(i+1)) #as training increases lrt starts high goes low real fast then gradually heads towards 1
mw = b1 * mw + (1 - b1) * dLdw #takes large ratio of a small ratio of the previous slope, and adds to a small ratio of current slope
vw = b2 * vw + (1 - b2) * dLdw**2 #takes really large ratio of a really small ratio of the previous slope squared, and adds really small ratio of current slope squared
w = w + lrt * mw / (vw**.5 + eps) #better prelim results with abs(dLdw) in vw and removing square root of vw in w; independent from prior mw and vw, this results is a tanh shape more intuitive for what one would like as slope increases or decreases
mb = b1 * mb + (1 - b1) * dLdb
vb = b2 * vb + (1 - b2) * dLdb**2
b = b + lrt * mb / (vb**.5 + eps)
dxdkkr = w #(classes, filts2*rsize2**2)
dLdkkr = (dLdx.reshape(classes, 1, 1, 1) * dxdkkr).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error
dkkrdkk = np.array(kk > 0, dtype = float) #derivative of RELU activation function
dLdkk = dLdkkr * dkkrdkk
dkkdkern2 = kr
dLdkern2 = np.zeros((filts2,filts,fsh2,fsw2))
for f in range(filts2):
for j000 in range(fsh2):
for jj000 in range(fsw2):
dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))
mk2 = b1 * mk2 + (1 - b1) * dLdkern2
vk2 = b2 * vk2 + (1 - b2) * dLdkern2**2
kern2 = kern2 + lrt * mk2 / (vk2**.5 + eps)
dkkdkb2 = 1
dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2
mkb2 = b1 * mkb2 + (1 - b1) * dLdkb2
vkb2 = b2 * vkb2 + (1 - b2) * dLdkb2**2
kb2 = kb2 + lrt * mkb2 / (vkb2**.5 + eps)
dkkdkr = kern2 # shape [filts2, filts * fsize2**2]
dLdkr = np.zeros((filts, rsh, rsw))
for ooo in range(filts):
for o in range(rsh2):
for oo in range(rsw2):
dLdkr[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdkr[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]
dkrdk = np.array(k > 0, dtype = float) #derivative of RELU activation function
dLdk = dLdkr * dkrdk
dkdkern = xx
dLdkern = np.zeros((filts,fsh,fsw))
for j00 in range(fsh):
for jj00 in range(fsw):
dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))
mk = b1 * mk + (1 - b1) * dLdkern
vk = b2 * vk + (1 - b2) * dLdkern**2
kern = kern + lrt * mk / (vk**.5 + eps)
dkdkb = 1
dLdkb = dLdk.sum(axis=(1,2)) * dkdkb
mkb = b1 * mkb + (1 - b1) * dLdkb
vkb = b2 * vkb + (1 - b2) * dLdkb**2
kb = kb + lrt * mkb / (vkb**.5 + eps)
checke = np.zeros(tests)
for i in range(tests):
xx = testimages[i]
k = np.zeros((filts, rsh, rsw))
for j in range(rsh):
for jj in range(rsw):
k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb
kr = np.maximum(0, k)
kk = np.zeros((filts2, rsh2, rsw2))
for j0 in range(rsh2):
for jj0 in range(rsw2):
kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)
kkr = np.maximum(0, kk)
x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b
if testlabels[i] == np.argmax(x):
checke[i] = 1
print(len(np.flatnonzero(checke==1))/tests)
Notice how each of the initializers use a square root. This is because these square root terms are standard deviations. Inside these square roots, you can see the researchers are manipulating the variance based on the size of the inputs. I have found that for convolutional layers, using the size of the kernel produces the best results. However, for fully connected dense layers, using the size of the weight itself doesn't produce the best results; in this case, I found that it is best to use the size of the input layer. The research papers show that manipulating variance for a certain layer by dividing by the size of the inputs (and outputs for Globot) can stabilize the variance of the next layer. This prevents exploding gradients and should efficiently pass along information between layers in the forward and backward propagations.
On the short tests that I ran, it turns out that I achieved better results when using Normal Initializers. Also, I found that if I used Globot for RELU activated inputs and HE initializers for linear or non activated inputs, I would get better results than if I did what was recommended. Strangely, I got the best results using the Lecun Normal Initializer. These results don't surprise me because I've read the research papers and I wasn't too convinced at the logic behind the integers they used for their formulas. Additionally, the logic behind why one would add the input layer (fan_in) size to the output layer (fan_out) size seemed almost arbitrary and not well supported. In these tests, Globot Initialization didn't produce the best results. Nonetheless, all these initializers produced significant results, which were much better than arbitrary assignment of random values.