ML Activation Functions From Scratch

May 10, 2023

Activation functions can be a useful tool in machine learning. Here, we are going to code activation functions and their derivatives from scratch. The activation functions of this article include RELU, sigmoid, and tanh. Full code and comments below. This is a continuation of the Neural Networks from Scratch and CNNs from Scratch articles if you'd like a better intuition of the code.

Activation functions are a simple tool that can normalize any layer. The sigmoid function takes an input layer and normalizes the output values to between 0 and 1. The tanh function takes an input layer and normalizes the output values to between -1 and 1. The beauty of these functions is that they are continuous and differentiable which helps in the backpropagation process.

A very popular one is the rectified linear unit or RELU. In the base example, Neural Networks From Scratch, the softmax activation function was used to normalize the numbers as probability scores between zero and one. A linear activation function is one that simply multiplies all inputs by “1”, which maintains all values. A linear activation function is the same as not using an activation function. The RELU activation function normalizes the numbers to all positive numbers; Any value in the convolutional layer output that is negative will be converted to a zero, and all positive numbers remain the same. Compared to other activation functions that use exponential functions with a high computational cost, RELU is much faster to compute. Despite the simplicity of the RELU activation function, it can still be differentiated which is very important in the backpropagation process. Luckily the backpropagation process for the RELU activation function is easy. The derivative is “0” for all values of the input that are not positive, and “1” for all values that are positive.

It may be counterintuitive to clip all values that are negative after the convolutional layer. After all, these negative values still contain information about the image. Though this is valid, adding filters may account for the loss of negative values in a given filter. Clipped negative output values for a specific filter, could be a positive output for another filter.

Positive Filter:

[[1, 0, 0],

[0, 1, 0],

[0, 0, 1]]

Negative Filter:

[[-1, 0, 0],

[0, -1, 0],

[0, 0, -1]]

The negative filter has the same shape as the positive filter, yet has negative multiples. Though the output of a given filter may have negative values clipped, additional filters may make use of those same values. Since it does not use an exponential, it can be more computationally efficient. Utilizing RELU and adding more filters can be more effective and practical than other activation functions with a higher computational cost.

# this code is a walk through of 2-layer CNN forward and backward propagation from scratch, using the adam optimizer

import numpy as np

mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here

# For example: “/Users/enrichmentcap/Downloads/mndata.npz”

trains, tests = 1000, 1000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset

trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]

##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset

##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()

##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255

np.random.seed(0)

classes = len(np.unique(trainlabels))

imw = trainimages.shape[2]

imh = trainimages.shape[1]

lr = .001

fsw = 8

fsh = 8

fsw2 = 4

fsh2 = 4

filts = 16

filts2 = 8

step = 1

step2 = 2

rsw = (imw - fsw) // step + 1

rsh = (imh - fsh) // step + 1

rsw2 = (rsw - fsw2) // step2 + 1

rsh2 = (rsh - fsh2) // step2 + 1

kern = np.random.rand(filts, fsh, fsw) - .5

kb = np.zeros(filts)

kern2 = np.random.rand(filts2, filts, fsh2, fsw2) - .5

kb2 = np.zeros(filts2)

w = np.random.rand(classes, filts2, rsh2, rsw2) - .5

b = np.random.rand(classes) - .5

b1 = .9

b2 = .999

eps = 1e-7

mw = 0

vw = 0

mb = 0

vb = 0

mk = 0

vk = 0

mk2 = 0

vk2 = 0

mkb = 0

vkb = 0

mkb2 = 0

vkb2 = 0

for i in range(trains):

xx, label, label[trainlabels[i]] = trainimages[i], np.zeros(classes), 1

k = np.zeros((filts, rsh, rsw))

for j in range(rsh):

for jj in range(rsw):

k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k)} #leak a=1 is the same as none

kr = afk['sig'] #make sure function chosen here is the same for dkrdk(ex. dkrdk = afdk['sig']) and testing loop towards bottom of code(ex. kr = afk['sig'])

kk = np.zeros((filts2, rsh2, rsw2))

for j0 in range(rsh2):

for jj0 in range(rsw2):

kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk)}

kkr = afkk['sig'] #make sure function chosen here is the same for dkkrdkk(ex. dkkrdkk = afdkk['sig']) and testing loop towards bottom of code(ex. kkr = afkk['sig'])

x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b

y = np.exp(x) / np.sum(np.exp(x))

dydx = -np.exp(x)[trainlabels[i]] * np.exp(x) / np.sum(np.exp(x))**2

dydx[trainlabels[i]] = np.exp(x)[trainlabels[i]] * (np.sum(np.exp(x)) - np.exp(x)[trainlabels[i]]) / np.sum(np.exp(x))**2

dLdy = 1 / y[trainlabels[i]]

dLdx = dLdy * dydx

dxdw, dxdb = kkr, 1

dLdw = dLdx.reshape(classes,1,1,1) * dxdw.reshape(1, filts2, rsh2, rsw2)

dLdb = dLdx * dxdb

lrt = lr * (1 - b2**(i+1))**.5 / (1 - b1**(i+1)) #as training increases lrt starts high goes low real fast then gradually heads towards 1

mw = b1 * mw + (1 - b1) * dLdw #takes large ratio of a small ratio of the previous slope, and adds to a small ratio of current slope

vw = b2 * vw + (1 - b2) * dLdw**2 #takes really large ratio of a really small ratio of the previous slope squared, and adds really small ratio of current slope squared

w = w + lrt * mw / (vw**.5 + eps) #better prelim results with abs(dLdw) in vw and removing square root of vw in w; independent from prior mw and vw, this results is a tanh shape more intuitive for what one would like as slope increases or decreases

mb = b1 * mb + (1 - b1) * dLdb

vb = b2 * vb + (1 - b2) * dLdb**2

b = b + lrt * mb / (vb**.5 + eps)

dxdkkr = w #(classes, filts2*rsize2**2)

dLdkkr = (dLdx.reshape(classes, 1, 1, 1) * dxdkkr).sum(axis=0)#dLdkk represents all class loss integrated into appropriate positions in kk output; meaning which positions in kk led to more loss or error

afdkk = {'none': 1, 'sig' : kkr*(1-kkr), 'tanh': 1-kkr**2, 'relu': np.array(kk > 0, dtype = float)}

dkkrdkk = afdkk['sig']

dLdkk = dLdkkr * dkkrdkk

dkkdkern2 = kr

dLdkern2 = np.zeros((filts2,filts,fsh2,fsw2))

for f in range(filts2):

for j000 in range(fsh2):

for jj000 in range(fsw2):

dLdkern2[f, :, j000, jj000] = (dLdkk[f].reshape(1,rsh2,rsw2) * dkkdkern2[:,j000:j000+rsh2,jj000:jj000+rsw2]).sum(axis=(1,2))

mk2 = b1 * mk2 + (1 - b1) * dLdkern2

vk2 = b2 * vk2 + (1 - b2) * dLdkern2**2

kern2 = kern2 + lrt * mk2 / (vk2**.5 + eps)

dkkdkb2 = 1

dLdkb2 = dLdkk.sum(axis=(1,2)) * dkkdkb2

mkb2 = b1 * mkb2 + (1 - b1) * dLdkb2

vkb2 = b2 * vkb2 + (1 - b2) * dLdkb2**2

kb2 = kb2 + lrt * mkb2 / (vkb2**.5 + eps)

dkkdkr = kern2 # shape [filts2, filts * fsize2**2]

dLdkr = np.zeros((filts, rsh, rsw))

for ooo in range(filts):

for o in range(rsh2):

for oo in range(rsw2):

dLdkr[ooo,o*step2:o*step2+fsh2,oo*step2:oo*step2+fsw2] += (dLdkk[:,o,oo].reshape(filts2,1,1) * dkkdkr[:,ooo,:,:]).sum(axis=0) #[filts2,] @ [filts2,fsize2,fsize2]

afdk = {'none': 1, 'sig' : kr*(1-kr), 'tanh': 1-kr**2, 'relu': np.array(k > 0, dtype = float)}

dkrdk = afdk['sig']

dLdk = dLdkr * dkrdk

dkdkern = xx

dLdkern = np.zeros((filts,fsh,fsw))

for j00 in range(fsh):

for jj00 in range(fsw):

dLdkern[:,j00,jj00] = (dkdkern[j00:j00+rsw, jj00:jj00+rsh].reshape(1,rsw,rsh) * dLdk).sum(axis=(1,2))

mk = b1 * mk + (1 - b1) * dLdkern

vk = b2 * vk + (1 - b2) * dLdkern**2

kern = kern + lrt * mk / (vk**.5 + eps)

dkdkb = 1

dLdkb = dLdk.sum(axis=(1,2)) * dkdkb

mkb = b1 * mkb + (1 - b1) * dLdkb

vkb = b2 * vkb + (1 - b2) * dLdkb**2

kb = kb + lrt * mkb / (vkb**.5 + eps)

checke = np.zeros(tests)

for i in range(tests):

xx = testimages[i]

k = np.zeros((filts, rsh, rsw))

for j in range(rsh):

for jj in range(rsw):

k[:,j,jj] = (kern * xx[step*j:step*j+fsh, step*jj:step*jj+fsw].reshape(1,fsh,fsw)).sum(axis=(1,2)) + kb

afk = {'none': k, 'sig' : 1/(1+np.exp(-k)), 'tanh': (np.exp(k)-np.exp(-k))/(np.exp(k)+np.exp(-k)), 'relu': np.maximum(0, k)} #leak a=1 is the same as none

kr = afk['sig']

kk = np.zeros((filts2, rsh2, rsw2))

for j0 in range(rsh2):

for jj0 in range(rsw2):

kk[:,j0,jj0] = (kern2 * kr[:,step2*j0:step2*j0+fsh2, step2*jj0:step2*jj0+fsw2].reshape(1,filts,fsh2,fsw2)).sum(axis=(1,2,3)) + kb2 #kk.shape = (filts2, rsize2, rsize2)

afkk = {'none': kk, 'sig' : 1/(1+np.exp(-kk)), 'tanh': (np.exp(kk)-np.exp(-kk))/(np.exp(kk)+np.exp(-kk)), 'relu': np.maximum(0, kk)}

kkr = afkk['sig']

x = (w * kkr.reshape(1,filts2,rsh2,rsw2)).sum(axis=(1,2,3)) + b

if testlabels[i] == np.argmax(x):

checke[i] = 1

print(len(np.flatnonzero(checke==1))/tests)

As we continue to understand machine learning from scratch, it is inevitable that we will have to deal with exploding gradients. This is when the trained weights contain numbers that the computer cannot handle. It would appear that using activation functions like sigmoid and tanh could prevent the problem by ensuring output values stay in a reasonable range. Exploding gradients can still occur because these activation functions use exponentials and matrix multiplication aggregates many multiplied values. So on one hand, exponentials are great because of their differentiability, but they can be a big nuisance when numbers become too large, or even too small. In conclusion, activation functions can be useful but do not solve everything. In fact, if it is not necessary to normalize outputs to a certain range, then activation functions are not necessary. Here are some more activation functions from scratch on a simple neural network:

import numpy as np

mnistlocation = “INSERT MNIST FILE LOCATION” #you can download the file here

# For example: “/Users/enrichmentcap/Downloads/mndata.npz”

trains, tests = 1000, 1000 #depends on computer processing speed. If computer isn't fast, reduce size of training and test dataset

trainimages, trainlabels, testimages, testlabels = np.load(mnistlocation)['trainimages'][:trains] / 255, np.load(mnistlocation)['trainlabels'][:trains], np.load(mnistlocation)['testimages'][:tests] / 255, np.load(mnistlocation)['testlabels'][:tests]

##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset

##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()

##trainimages, testimages = trainimages[:train]/255, testimages[:test]/255

classes = len(np.unique(trainlabels))

imw = trainimages.shape[2]

imh = trainimages.shape[1]

lambd = 1.0507009873554804934193349852946

alpha = 1.6732632423543772848170429916717

def mlfun(actf, lr = .04, a = .1): #"a" is for leaky RELU; set to 0 for RELU (note RELU is not useful for 1-layer NN; scarce number of nodes and a=0 loses too much information

w = np.zeros((imw*imh, classes))

b = np.zeros(classes)

for i in range(trains):

xx, label = trainimages[i].flatten(), trainlabels[i]

x = np.dot(xx, w) + b #matrix multiplication

af = {'none': x, 'sig' : 1/(1+np.exp(-x)), 'tanh': (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x)), 'relu': np.maximum(0, x), 'leakyrelu': np.maximum(a*x, x), 'selu': np.array([(lambd * alpha * (np.exp(i) - 1)) if i < 0 else lambd*i for i in x]),'elu':np.array([a*(np.exp(i)-1) if i < 0 else i for i in x])} #leak a=1 is the same as none

rl = af[actf]

y = np.exp(rl) / np.sum(np.exp(rl))

dLdy = 1 / y[label]

dydrl = -np.exp(rl)[label] * np.exp(rl) / np.sum(np.exp(rl))**2

dydrl[label] = np.exp(rl)[label] * (np.sum(np.exp(rl)) - np.exp(rl)[label]) / np.sum(np.exp(rl))**2

afd = {'none': 1, 'sig' : rl*(1-rl), 'tanh': 1-rl**2, 'relu': np.array([0 if i <= 0 else 1 for i in x]) , 'leakyrelu': np.array([a if i <= 0 else 1 for i in x]) , 'selu': np.array([(lambd * alpha * (np.exp(i))) if i < 0 else lambd for i in x]), 'elu': np.array([a*(np.exp(i)) if i < 0 else 1 for i in x])}

drldx = afd[actf]

dydx = dydrl * drldx

dLdx = dLdy * dydx

dxdw, dxdb = xx, 1

dLdw, dLdb = np.dot(dxdw[np.newaxis].T, dLdx[np.newaxis]), dLdx * dxdb

w , b = w + dLdw * lr, b + dLdb * lr

checke = np.zeros(tests)

for i in range(tests): #testing derived weights and biases

xx, label = testimages[i].flatten(), testlabels[i]

x = np.dot(xx, w) + b

af = {'none': x, 'sig' : 1/(1+np.exp(-x)), 'tanh': (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x)), 'relu': np.maximum(0, x), 'leakyrelu': np.maximum(a*x, x), 'selu': np.array([(lambd * alpha * (np.exp(i) - 1)) if i < 0 else lambd*i for i in x]),'elu':np.array([a*(np.exp(i)-1) if i < 0 else i for i in x])} #leak a=1 is the same as none

rl = af[actf]

y = np.exp(rl) / np.sum(np.exp(rl))

if label == np.argmax(y):

checke[i] = 1

return len(np.flatnonzero(checke==1))/len(checke)

print(mlfun(actf = 'leakyrelu', lr = .01, a = .1)) #try any of the activation functions in the dictionary which include: 'leakyrelu', 'selu', 'elu'