Neural Networks From Scratch
May 8, 2023
Understand Neural Networks and the math behind Machine Learning (ML) and Artificial Intelligence (AI) simply using python (code posted below). You can download the mnist dataset ("mndata.npz") below if you would like to make the code as simple as possible. For this, we only need the Numpy library from python. Numpy is my personal favorite for intuitively manipulating matrices, also known as arrays.

ML and AI use neural networks as the underlying architecture to solve real world problems. Let’s examine the technology by building a neural network from scratch. These networks incorporate calculus and matrix multiplication so brace yourselves.
The first step would be to retrieve data that will be used to train a model. We’re going to start with the classic mnist dataset. The machine learning model will be able to classify the inputs – images of hand written digits - and produce a desired output – the predicted digit. Retrieving and organizing the data can be quite time consuming so we’re going to expedite this process as much as possible. I’ve uploaded the data into numpy arrays so you can download them in an organized way with a one-liner (if you’re more comfortable using an additional python library to access the data, I got you covered skip down to the tensorflow addendum).
The mnist dataset consists of a set of 60,000 training images and labels and another set of 10,000 testing images and labels. These grayscale images are a 28x28 array of numbers between 0-255, where the number represents the brightness of color, which is white in the case of black and white or grayscale images. It is conventional to normalize the images to values between 0-1 by dividing every value of the array by 255 or the maximum color value. The practical reason for this is that this can produce numbers that are too large for python’s exponential function to compute.
Once you download the above numpy “mndata.npz” dataset, copy the file location of the dataset and paste it into the np.load() command.
import numpy as np
mnistlocation = “INSERT MNIST FILE LOCATION”
#For example: “/Users/enrichmentcap/Downloads/mndata.npz”
trainimages, trainlabels, testimages, testlabels =
np.load(mnistlocation)[“trainimages”] / 255,
np.load(mnistlocation)[“trainlabels”],
np.load(mnistlocation)[“testimages”] / 255,
np.load(mnistlocation)[“testlabels”]
There are many ways to get this dataset in python. Here’s another way which requires the tensorflow library to be installed:
##import requests
##requests.packages.urllib3.disable_warnings()
##import ssl
##try:
## _create_unverified_https_context = ssl._create_unverified_context
##except AttributeError:
## # Legacy Python that doesn't verify HTTPS certificates by default
## pass
##else:
## # Handle target environment that doesn't support HTTPS verification
## ssl._create_default_https_context = _create_unverified_https_context
##
### uncomment the above code if you get a certificate error on mac. No issue for windows.
##the following code does not work for python 3.7 on windows, but does work on python 3.8+ on windows and mac; also works on google colab (perhaps jupyter notebooks too?). If you’re a linux user, I know you can figure it out on your own.
from tensorflow.keras.datasets import mnist
(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()
It’s crucial to understand the format, content, and shape of the loaded dataset arrays. First, lets start with the difference between scalars, lists or vectors, and arrays or matrices as used in python. A scalar is just a single number all by itself. A vector or list as it’s called in python is used for one series or one string of numbers. A matrix or array as its called in the python numpy library is a multidimensional series of numbers for neural networks the terms can be used interchangeably. In mathematics, these sets of numbers or values are analogous to a point, a line, and a traditional 3D object. In the physical world, 0D is complete nothingness, 1D is a singular point, 2D is a line, 3D is a flat plane, 4D is space frozen in time, 5D is the dimension we live in which is space moving through time, and 6D is where comic books live.
In the python world, it is often required that an object be a scalar, list, or array. It is important to note, that a python numpy array object can have a single value or one series of values. This is often important for compatibility reasons when combining or doing mathematical operations among scalars, lists, and arrays, which may only compute if they are all converted into specific compatible formats. This can get really complicated when using tuples vs lists and tensors vs arrays.
The shape of the training images array is [60000,28,28]. This means that there are 60000 different 28 x 28 pixel images. The first element in the array’s shape is the total number of images. Similarly, the shape of the testing images array is [10000,28,28]. For these grayscale images, the last element in the shape signifies columns and the second to last element signifies row. The first column represents the leftmost pixel row and the last column represents the rightmost column.
The first row represents the uppermost pixel row of the image and the last row represents the bottom pixel row. I know it can be annoying because it may feel backwards and it’s unlike graphing x,y coordinates in traditional math classes, but you get used to it once you start working with arrays and tensors of multiple dimensions. While we’re talking about things that you eventually get used to, did I mention python uses base 0 indexing meaning the first value in a vector or array is actually the 0th value.
To examine how these numerical pixel values represent a handwritten digit, we must understand indexing, subscripting and slicing, which is using brackets “[]” and colons “:” to specify what rows, columns, or dimensions we are referring to. For example, to see what the first image in the mnist dataset looks like we could enter “trainimages[0]”. This subscripts into the trainimages array and slices out the first or 0th 28x28 image. A single number inside a bracket returns the first or left most element in the array’s shape, which is [60000, 28, 28]. To access the last image in the dataset, we would enter “trainimages[-1]” and “trainimages[-2]” for the second to last image. Keep in mind “trainimages[-0]” is the same as “trainimages [0]”.
Similarly, “trainimages[0 ,: , :]” or “trainimages[0,…]”can be entered and accomplishes the same thing; the single colon “:” means to slice and return all values in that particular row, column, or dimension; The 3 values inside the bracket separated by commas correspond to the specific dimensions being subscripted or called. The “…” must still be separated by a comma “,” and means to return all values of all dimensions following, or even preceding the specified dimension, depending on where it is placed in the subscripting brackets.
If we want to refer to all the columns, up to but excluding the right most column, the 7 bottom rows, of the fourth to fifth images, we would enter “trainimages[3:5, -7:, :-1]”. The python index number on the left of the column is included in the slice, and the python index number to the right of the column is not included in the slice. A number followed by a single colon means to start retrieving at that number and return all following numbers. A single colon followed by a python index number retrieves up to but not including that python index number.
Now’s a great time to examine the images and get a general intuition of how the pixel dimensions work. The numpy library automatically outputs a default array line width of 75 characters, so the arrays don’t display as informatively as it would in a spreadsheet. To fix this, set the numpy array print options to a value that can accommodate all the columns in a single output line using the following code:
“np.set_printoptions(linewidth=200)”
The value can be changed according to arrays with more columns or values of extended character length.
The python index number of the training and testing images correspond to the python index number of the correct training and testing labels. Hence, there are 60000 correct labels for the 60000 training images and 10000 correct labels for the 10000 training images. The shape of the training label array is [60000,] and the shape of the testing label array is [10000,]. This means that this numpy array object is a one dimensional vector, without what could be referred to as rows or columns. In certain use cases, this can produce different results than a [1,60000] shape array or [60000,1] shape array which is a two dimensional numpy array consisting of rows and columns. Sometimes the python backend will automatically correct inconsistencies between operations of different object shapes and accurately infer what the user is trying to do.
The number of classifications of the labels or classes is 10 for every digit in the traditional numbering system. We can convert the individual labels from scalars to one hot encoding which drastically helps in understanding how neural networks classify inputs. Each label consists of 10 values, one for each digit classification with each classification represented by the python index number. A value of 1 at a specific python index number indicates the correct label, a value of 0 indicates the incorrect label for the particular image. Here are the one-hot encoded labels for each digit:
0: [1,0,0,0,0,0,0,0,0,0]
1: [0,1,0,0,0,0,0,0,0,0]
2: [0,0,1,0,0,0,0,0,0,0]
3: [0,0,0,1,0,0,0,0,0,0]
4: [0,0,0,0,1,0,0,0,0,0]
5: [0,0,0,0,0,1,0,0,0,0]
6: [0,0,0,0,0,0,1,0,0,0]
7: [0,0,0,0,0,0,0,1,0,0]
8: [0,0,0,0,0,0,0,0,1,0]
9: [0,0,0,0,0,0,0,0,0,1]
Neural networks transform the 28x28 image into a 10x1 one-hot encoded label. This can be accomplished by use of matrix multiplication. In this case, we will be using one simple layer of weights and biases.
In order to accomplish this matrix multiplication, we must first flatten the image. This means that we will convert the 28 x 28 image into a vector of length 784 pixel values using the numpy.flatten(). This is the input layer. Starting with the first row or 0th index row, the flatten() function concatenates the following 27 rows in order, creating one row of 784 pixels, which is equivalent to the multiplication of the row and height lengths 28 * 28 = 784. Python treats a flattened 1D vector like an array of 1 row of varying lengths; in this case: [1, 784]. In computational programming, if the array shapes aren’t perfectly aligned according to the default specifications, it will not compute. So when we attempt to create neural networks from scratch, we must keep track of the shapes we are performing mathematical transformations on.
Remember how we were talking about dimensions. In python, 0 dimensions is a scalar, 1 dimension is a vector, 2 dimensions is a flat grid, 3 dimensions is 3D and so on. However, when we look at numpy array shapes using “trainimages.shape”, it shows 3 dimensions with a shape of: [60000,28,28]. The 0th dimension is of size 60000, and the 1st and 2nd are of size 28. In the case of the flattened image [784,] the 0th dimension has a size of 784. When we enter it into the numpy matrix multiplication functions, it automatically registers as a [1,784] shape, where the 0th dimension is of size 1 and the 1st dimension is 784, according to python index numbering.
In order to transform the input image into a 10x1 one-hot encoded predicted label, we must create an arbitrary array filled with weights of shape [10,28,28] or [classifications, imageheight, imagewidth]. The reason the values can be somewhat arbitrary is because each of the values in the weights will be independently adjusted and fine tuned. Since there is only one training layer in this model, these values can all be zero using “numpy.zeros()”; normally weights are composed of random numbers using a random number generator. Keep in mind that python shape, size, or length numbering indicates actual number of values not its index value. The numpy “dot” and “matmul” functions can be used interchangeably for inputs of two dimensions each. Matrix multiplication or dot products work by taking the row of one input, and multiplying it by the column of another, adding all of those values and assigning that value to the corresponding output row and column. This is why the size of at least one of the input’s row must equal the size of the other input’s column.
The first row of the image consisting of 784 pixel values, will be multiplied by the first column of the weights which also consists of 784 values. Specifically, this means that the corresponding 0th values of the row and column input will be multiplied. Afterwards, the corresponding 1st values of the same row and column input will be multiplied, and so on for all values in the matching row and column. Each of the 784 multiplied values of the 0th row on the first input and the 0th column of the other input will be summed into 1 value (a sum of the products); this value will be assigned to the 0th row and 0th column of the output array. This mathematical operation between one row and one column or two vectors is called a dot product. Performing this across a series of rows and columns of an array or matrix, is matrix multiplication. Since there is only 1 row for the image, this same 0th row of 784 values will be multiplied by the next or 1st column of the weights array also consisting of 784 values and the summed to create 1 value assigned to the 0th row and 1st column of the matrix multiplication output. This process will occur 10 times- one for each column corresponding to the number of digit classifications. This output will produce a shape of [1,10] that python may automatically convert into a [10,] shape 1D vector. A simplified way to approach the matrix shape complexity, is to think about it as an inner product. You pair two array shapes side-by-side in a specific order: [1,784] and [784,10], such that if the inner dimensions of the two inputs correspond (784 and 784), they can be matrix multiplied and the resulting shape will be the outer dimensions (1 and 10). When using “dot” or “matmul” operations in python, it will not compute if the inputs are not ordered in the correct way.
It is easiest to think of the shapes from right to left. The rows and columns make up the last two elements in the dimensions. For example, [60000,3, 28(rows or height),28(width or columns)] would mean there are 60000 colored images of 28 pixel height and 28 pixel width. Colored images are of three channels which correspond to RGB or red, green, and blue color values. Therefore, each image has three channels for color, and there are 60000 of these colored images. The dimensions can be in any order as long as they correspond to the correct mathematical matrix operations. Conceptually, it is simplest to structure these matrices from right to left.
In order to transform the input image into a 10x1 one-hot encoded predicted label, we must create an arbitrary array filled with weights of shape [10,28,28] or [classifications, imageheight, imagewidth]. The reason the values can be somewhat arbitrary is because each of the values in the weights will be independently adjusted and fine tuned. Since there is only one training layer in this model, these values can all be zero using “numpy.zeros()”; normally weights are composed of random numbers using a random number generator. Keep in mind that python shape, size, or length numbering indicates the actual number of values not its index value. The weights of shape [10,28,28] represents 10 classifications of a 28 x 28 image. Then, the input image [28,28] is multiplied by all 10 of these classifications each of size [28,28]. Matrices of exactly the same size can be multiplied by the corresponding element or item in the other matrix. This produces an output of the same shape. When this multiplication is completed, we sum all the values of the entire [28,28] output. This is done for each of the 10 classifications and produces 10 values or scores, one for each label classification. In python, in order to multiply the input image to the weights, we must ensure they have the same number of dimensions. We can do this by reshaping the input image into a [1,28,28] by using numpy’s “reshape” function. Afterwards, the input image of shape [1,28,28] and the weights of shape [10,28,28] can be multiplied even though they do not have the same shape. This multiplication will create an output of [10,28,28]. Python automatically matches similar dimensions and multiplies the dimension of size 1 item-wise or element-wise across the other matices corresponding dimension. Sometimes, python will automatically infer this and the reshaping function is not necessary. To sum the values across each of the 10 classifications, we can use numpy’s “sum” function. This function is very useful in matrix operations because we can specify which dimensions or “axis” we would like to sum across. Once the multiplication between the image and the weights are complete, we would sum across two axes because the image has two dimensions for rows and columns. A sum on axes of index numbers 1 and 2 across the a matrix of shape [10,28,28] would produce an output of [10,]. This process demonstrates how to do a dot product or matrix multiplication manually. As input dimensions increase, this can be helpful in keeping track multiplications and summations, especially in the backpropagation process.
Each hidden layer of weights (1 layer in this simple neural network) comes with a set of biases, which are also trained in fine tuned. However, instead of completing matrix multiplication on the input, the biases are simply added to the output of the matrix multiplication, which is automatically converted to shape [10,]. These biases also have the same shape [10,] and consist of arbitrary values which can be zero or random values. Generally speaking, consistency across a model is good practice. Each element of the matrix multiplication output will be added to the corresponding element of the biases, producing a hidden layer output of shape [10,].
Interestingly, this looks quite similar to slope-intercept form, which is the equation for a line which is typically written as y = m*x + b or y = [slope] * x + [y-intercept]; where the y-intercept or b value orients a line in its correct position at a given slope value. We can think of the weights and biases as being trained to best predict or approximate a real world linear equation (output = weights * input + biases) that produces optimal classification results for this dataset. These equations are called first degree equations because the input variable “x” is of power 1, or x^1 or x**1, as written in python. The first degree equations become advantageous when dealing with differentiation during the backpropagation step.
The array of weights matrix multiplied with the input images and then added to the [10,] biases are a set of 7840 (one for each weight) different first degree equations, where a set of 784 equations are summed and added to a bias. This summation can occur because they all have the same input or “x” value. Machine learning attempts to find the optimal slope and biases for each of the 7840 first degree equations.
To be clear, an equation made up of optimal weights and biases for this classification task does in fact exist in the physical world and the goal of machine learning is to arrive as close as possible to this equation by tuning weights and biases in a first degree equation. We can think about it as though we are trying to guess the equations of 7840 lines that exist in the real world. If all the equations in our model had no b value or were fixed at b = 0, then that would mean that we would be assuming that all 7840 lines all cross the y-intercept at 0, which in the real world would be quite improbable. Adding the biases allows the model to better approximate the lines or equations we are “guessing” while still allowing the possibility of b = 0 in the equations. It should be noted that the standard use of first degree equations in machine learning works, but may be somewhat arbitrary. In 3D rendering applications, any shape can be rendered by a series of triangles which are composed of lines. Perhaps the more weights that are used, the more equations of lines that can perhaps better approximate the optimal set of equations for the task. It could be likely that the optimal set of equations used to predict outputs for a given task may not even be a first degree equation.
Once we complete the matrix multiplication and add in the bias values to each of the 10 classifications, we must normalize the output of the hidden layer. The labels are one-hot encoded and consist of values 0 or 1. Therefore, we must normalize the predictions to values between 0 and 1. This can be done through various normalization functions which are generally called activation functions. A popular activation function to use in this classification task is the softmax function, which is written as “np.exp(x) / np.sum(np.exp(x))” in python. The first advantage of using the exponential function of base “e” is that it converts all the previous layer outputs of shape [10,] to positive numbers. By dividing each of these values by the sum of all values, we get a probability distribution of what the model believes the input to be. If the exponential function was not applied in our model, then this would create negative values in the probability distribution if the outputs of the hidden layer were also negative. If this occurred, the output values may not fall between 0 and 1, which would be detrimental to our loss function and ensuing backpropagation.
If we started with all weight and bias values set to 0, the initial hidden layer output of shape [10,] would all be 0. When we apply the softmax function to this hidden layer, the output will be of shape [10,] with all values of 0.1. This means that the model predicts that all classes have the same probability of being the correct label. This makes sense because the model has not been trained yet. As with any probability distribution, the sum of the values of the predicted probabilities will always equal 1. A value towards 1 for a classification in the output layer signifies that the model predicts a high probability of that classification being the correct label. A value towards 0 for a classification signifies a low probability of that classification being correct.
So far we created a model that can predict an output. However, we must be able to train the model and make the model’s predictions more and more accurate. We start by defining a loss function. For this article, will be using error squared loss, which is the most intuitive. This is simply the difference between the one hot encoded label of shape [10,] and the output layer predictions of shape [10,], and these differences are squared making the loss values of shape [10,] all positive values. Once again, the benefit of the error squared loss function is the property of its derivative and resulting gradients for backpropagation. It is important to note that for this neural network, all 10 probability predictions and corresponding loss values are significant. The model will be trained such that for a given input image, the model will produce an output where all incorrect labels converge to 0 and the correct label converges to 1. Using a loss function that accommodates this will allow the use of other normalization or activations than the softmax function. In this case, the softmax function produces a probability distribution which will sum to a value of 1; therefore, as long as the correct labels converge to a value of 1, the incorrect labels will in aggregate, automatically converge to 0. However, not all activations create a probability distribution that sum to a value of 1. When it comes to tasks with more real-world applicability like object detection, a probability distribution in the output layer will not be useful. Therefore, the loss is structured in such a way that we can substitute the softmax function for another activation function like the sigmoid function which also normalizes the input to values between 0 and 1, yet without the properties of a probability distribution.
Often popular loss functions are referred to as “mean” or “sum” squared error. However, when computing from scratch, if we combine these loss values into a single function through a mean or sum, we would lose class specific gradients in the backpropagation process. This is due to the fact that each of the loss values are mathematically independent of each other. To combine them into a single sum or mean term as a final loss value to be backpropagated would be mathematically inaccurate. The loss values used for backpropagation must have the same shape as the output layer in order to adjust the 10 classification specific weight and bias values according to its corresponding 10 loss values. Keep in mind the weights are organized into 10 independent classifications (columns) of 784 rows.
The training process requires a basic understanding of calculus and differential equations as the derivatives of each of the equations used for model prediction must be determined. The general idea for optimization or minimization functions is to find a combination of weights and biases that produce slopes or gradients of 0 for all 10 of loss functions. A slope or gradient of 0 means that the loss has been minimized or optimized. This is accomplished by subtracting a ratio of the derivative of the loss with respect to the weights and biases by the current weights and biases. The way we find the derivative of the loss with respect to the weights and biases is through chain rule. If we think about it in terms of algebra it can be pretty simple. Let’s say we have the equation: b( c(x) ). The derivative of the function “b” with respect to “x” (which we will label “db/dx") is the derivative of “b” with respect to “c(x)” (which we will label “db/dc”) multiplied by the derivative of “c” with respect to “x” (which we will label “dc/dx”). Therefore, db/dx = (db/dc) * (dc/dx); which makes perfect sense in algebraic terms. This is exactly what occurs in a neural network. In this case the equation is:
Loss (image, w, b) = softmax( (hiddenlayer (image, weights, biases) ) )
The hidden layer has three inputs of image, weights, and biases. The derivative of the loss with respect to the input image, weights and biases are:
dLoss/dimage = (dLoss / dsoftmax) * (dsoftmax/dhiddenlayer) * (dhiddenlayer/dimage)
dLoss/dweights = (dLoss / dsoftmax) * (dsoftmax/dhiddenlayer) * (dhiddenlayer/ dweights)
dLoss/dbiases = (dLoss / dsoftmax) * (dsoftmax/dhiddenlayer) * (dhiddenlayer/ dbiases)
The derivative of the loss function ( (label - softmax)**2 ) with respect to the softmax function is:
dLoss / dsoftmax = -1 * 2 * (label - softmax)
The softmax derivative is derived using the quotient rule. Unlike the loss function, we are able to exponentiate and sum the hidden layer as we are using this summation value to divide by all values. Therefore, this softmax layer maintains the same shape as the hidden layer. To understand what is occuring in this softmax derivative applied to 10 values from the hidden layer, we can use an example of three values (a, b, c) applied to the softmax function:
Softmax(a) = exp(a) / ( exp(a) + exp(b) + exp(c) )
Softmax(b) = exp(b) / ( exp(a) + exp(b) + exp(c) )
Softmax(c) = exp(c) / ( exp(a) + exp(b) + exp(c) )
The quotient rule requires the derivative of the numerator and the denominator. In this example, the denominators are all the same:
Denominator = ( exp(a) + exp(b) + exp(c) ).
However, the values of the derivatives of this denominator will be different according to which variable (a, b, c) we are determining the derivative. The derivative of exp(x) equates to exp(x), which can keep things simple when it comes to computing. Therefore, the derivative of the denominator with respect to each variable is:
dDenominator/da = exp(a) + 0 + 0 = exp(a)
dDenominator/db = 0 + exp(b) + 0 = exp(b)
dDenominator/dc = 0 + 0 + exp(c) = exp(c)
When deriving a function with “respect” to a specified variable, we treat other input variables as constants. When taking the derivative of the denominator with respect to “a”, the terms not containing a nonzero multiple of “a” will be dropped out as they are treated as constants; and the derivative of a constant is always 0. Hence, the derivative of the denominator is:
f'(Denominator(a,b,c)) = dDenominator/da + dDenominator/db + dDenominator/dc = exp(a) + exp(b) + exp(c)
Written in python, the derivative of the softmax function with respect to the hidden layer is:
dsotfmax / dhiddenlayer =
(np.exp(hiddenlayer) * np.sum(np.exp(hiddenlayer)) - np.exp(hiddenlayer) * np.exp(hiddenlayer) ) / (np.sum(np.exp(hiddenlayer)) * np.sum(np.exp(hiddenlayer)) )
It should be noted that popular classification loss functions, like cross-entropy loss, and their corresponding derivatives are much less intuitive. They treat the correct label classification mathematically different from incorrect label classifications and can produce significantly better results. In more real-world machine learning applicability, these classification models can be inferior and even useless. When it comes to research, development, and practicality, it is always better to use the model that produces the best results for a given task, regardless of a model’s intuitiveness. However, the intent of this article is focused more on intuition and insights for further research and development. This is why the code below is meant for readability and mathematical intuition.
The derivative of the loss with respect to the input image (“dloss/dimage”) and also “dhiddenlayer/dimage” are unnecessary for training or backpropagation because the image will not be trained or adjusted. However, in models with multiple hidden layers, derivatives of successive layers and each of their inputs will need to be calculated. The hidden layer consisting of 3 inputs or variables can be written as hiddenlayer = weights * image + biases. The derivatives of the inputs are as follows:
dhiddenlayer /dimage = weights + 0 = weights
dhiddenlayer /dweights = image + 0 = image
dhiddenlayer /dbiases = 0 + 1 = 1
The derivative of a constant multiplied by a input variable (constant * x) equals the constant. Similar to the softmax denominator, when taking the derivative with respect to a specified variable in a function, the other variables or inputs in the function are treated as constants, whose derivative is 0. Therefore,
dloss/dweights[10,28,28] =
dloss/dsoftmax[10,] * dsoftmax/dhiddenlayer[10,1, 1] * dhiddenlayer/dweights[1,28,28]
dloss/dbiases[10,] = dloss/dsoftmax[10,] * dsoftmax/dhiddenlayer[10,] * dhiddenlayer/dbiases[10,]
Since the “dloss/dweights” array must be subtracted from the weights array in order to fine tune or train the weights, it must contain the same shape [10,28,28]. This is accomplished by first multiplying “dloss/dsoftmax” of shape [10,] and “dsoftmax/dhiddenlayer” of shape [10,] which produces “dloss/dhiddenlayer” of shape [10,].
To complete the chain rule, we must multiply “dloss/dhiddenlayer” of shape [10,] with “dhiddenlayer/dweights” of shape[28,28]. In order to maintain the same shape as the weights when conducting this multiplication, we must first convert the inputs into arrays with the same number of dimensions. We can easily reshape “dhiddenlayer/dweights” to shape [1,28,28] and “dloss/dhiddenlayer” to shape [10,1,1] using numpy’s reshape() function. We can then apply matrix multiplication between the shapes of [1,28,28] and [10,1,1] to produce “dloss/dweights” of shape[10,28,28]. Although none of the dimension sizes match, when the number of dimensions are the same, python automatically reads the dimensions of size “1” and multiplies the 10 values across the row and column dimensions of the other matrix. It can be thought of as multiplying 1 value across all 28x28 pixels(dhiddenlayer/dweights) and doing this process 10 times, resulting in a shape of [10,28,28].
Let’s examine what actually is occuring. First, the derivative of the loss values are multiplied by the softmax layer. This number represents a certain magnitude of loss for each of the 10 classifications and the direction of the loss. A positive or negative value illustrates whether the predicted value is too high or too low relative to the corresponding one-hot encoded label classification. From an intuitive standpoint the softmax layer derivative is a pass through layer and simply connects the derivative of the loss to the weights. This is because the higher the value of the input into the softmax, the higher the value of the softmax output. Therefore, the numerical order or sorting of classification values before applying to the softmax function will also be maintained in the softmax output. The only difference between the softmax input and output is the ratios between the classification values. This only affects backpropagation or training by the magnitude and direction of the loss and its corresponding label classification. This value is then matrix multiplied by “dhiddenlayer/dweights” which is simply the input image. In essence, “dloss/dweights” is produced by performing itemwise multiplication of the same input image across all the classifications’ loss direction and magnitude derived values. A factor (called a learning rate) of “dloss/dweights” array is then subtracted from the weights to adjust and train all values of the weights. The result is to increase the values of the active pixels of the image in the correct classification, and to decrease the values of the active pixels of the image in the other incorrect classifications. Therefore, when the same image is passed through these updated weights, it will produce a higher prediction score for the correct classification and produce lower prediction scores for incorrect classifications.
The derivative of “dloss/dbiases” is also accomplished by first multiplying “dloss/dsoftmax” of shape [10,] and “dsoftmax/dhiddenlayer” of shape [10,] which produces “dloss/dhiddenlayer” of shape [10,]. Since “dhiddenlayer/dbiases” = 1 for all [10,] classes, it can be easily multiplied into “dloss/dhiddenlayer” without any reshaping; if set to a scalar value of 1 of no size or shape, python will automatically multiply “1” itemwise across all values of “dloss/dhiddenlayer”. Intuition of backpropagation or training of “dloss/dbiases” is similar to “dloss/dweights”, but is specifically adjusted by a factor (learning rate) of “dloss/dhiddenlayer”. This means that the biases of shape [10,] are each adjusted based on the corresponding direction and magnitude of the classification specific loss values[10,].
In this simple example, the weights and biases are adjusted based on a technique called stochastic gradient descent. To understand this more, we can use the core of our loss function which is: y = x**2, as written in python. As previously mentioned, if we can somehow arrive to an x position where the slope is 0, we can be reasonably confident that we have reached a minimum y value for the equation.
Suppose we are trying to arrive at an unknown minimum y value of this known function and we begin arbitrarily at x=2. The slope of the equation is dy/dx = 2 * x, since we already know the loss function and some basic calculus. Therefore, the slope at our starting point of x=2 is 4. Stochastic gradient descent simply takes a factor (learning rate) of this “dy/dx” value and subtracts it from our current position. The learning rate is a hyperparameter. Let’s say that we arbitrarily choose a learning rate of lr = .25. The formula for stochastic gradient descent is x = x - lr * (dx/dy). Therefore, our new x-position would be: 2 - .25 * (4) = 1. The next x-position would be 1 - .25 * (1) = .75. After many iterations, we would converge around x = 0, which is the location of the minimum y-value of the equation. The learning rate prevents overshooting past the minimum y value. The smaller the learning rate, the more closely the result converges to the minimum value, given a sufficient number of iterations.
Let’s say we started at x = -2 under the same conditions. In order to arrive to the minimum y-value, we would choose our next x-position based on x - lr * (dx/dy). The next x-position would then be x = -1, and the following x-position would be x = -1. This demonstrates that the value of the slope at any given point contains information that can assist in arriving to an optimization point through the magnitude of the slope and its direction (positive or negative). If dy/dx = 4 and our goal is to eventually converge to dy/dx = 0, we know that if we take a ratio of the current slope and subtract it from our current x-value, dy/dx should get closer to 0. By continuing this process over many iterations, we will eventually converge to a position reasonably close to dy/dx = 0.
Since we knew the equation was y = x**2 in the previous example, it was inferred that we are dealing with a minimization problem. Alternatively, if we decided to arbitrarily convert the starting equation to a maximization problem by multiplying the function by “-1”, our function would become: y = -x**2. The user determines whether the final loss function is structured as a minimization or maximization problem. It turns out this negation of the same function does not change the x-position of optimization. However, instead of subtracting (dy/dx) the new x for a maximization problem becomes: x = x + lr * (dx/dy). The negative value must still be incorporated when deriving any function. Due to the nature of maximization problems and functions, if dy/dx = 4 and our goal is to eventually converge to dy/dx = 0, we know that if we take a ratio of the current slope and add it to our current x-value, and continue this process over many iterations, we will eventually converge to a position reasonably close to dy/dx = 0.
The derivatives of the functions in our model represent slopes or gradients. The process of multiplying derivatives of all the functions according to the chain rule, in order to adjust and train the weights and biases is backpropagation.
Once we choose a value for our learning rate hyperparameter (lr = .01), we can complete the training process. Once the training process is completed by iterating over all the training images and labels in the training dataset, we can test the weights and biases on the testing dataset.
Unlike the training process, which is done serially over the training images and labels, the testing process can be done in parallel over all the testing images and labels at the same time. Since the softmax function is similar to a pass through layer where the max hidden layer scores will also be the max scores in the softmax function output layer, we can remove this layer in the testing phase. Additionally, since the training process is complete, there will be no need for further backpropagation nor the calculation of the model functions’ derivatives. We can simply, flatten the testing images as done in the training process, matrix multiply all testing images by the same weight values and add in the same bias values to arrive to a classification score. The classification with the highest score is the model’s label prediction for the testing image. Finally, we calculate the number of correct model prediction by the total number of model prediction which demonstrates the model’s accuracy.
This entire training and testing process can be accomplished from scratch in less than 30 lines of python code, while still illustrating the mathematics at the core of neural networks. Although it is not common to use a single hidden layer in a neural network, this is done to gain intuition about how a neural network can learn. Mainly, that active pixels or values in an input result in those same active values being backpropagated through the model according to the loss function across all classifications. A less sophisticated way of doing this would be to start with the same weight array, add a training image to the correct classification, and subtract the same image from all other classifications. Although this can work, it is not as effective.
Additional variations, to this model would be to avoid normalizing the training and testing images to values between 0 and 1, and maintain original values from 0 to 255. The practical reason is that this will produce numbers that are too large for python’s exponential function to compute. Also, the weights and biases can any random values instead of all zeros. Zeros were used in this example so that the backpropagation process can be easily tracked. As stated earlier, the softmax derivative and loss function used in traditional models can produce slightly better results than this intuitive model. This demonstrates that there is an infinite number of loss functions and derivative techniques to be researched and tested. The metric for accuracy in the training layer is called “loss” because it measures the difference between the label and the predicted outputs. This can be converted into a maximization function, which would not change the testing accuracy results. An additional variation to this model would be to change the order training images. Currently, the order of training image classifications are random. One could attempt to first train the model on all "zero" digits, then all "one" digits, then all "two" digits, and so on. Using the same parameters, this produces unfavorable results similar to randomly guessing the correct classification. One could also attempt to cycle through the digits in a consistent order to balance the learning through weights and biases. This produces slightly better results than randomly training through the default images. This introduces the idea of batching the data and iterating over the same data over different batches by epochs. Additionally, one can also adjust the learning rate value and even change the learning rate to more advanced formulas and methods than stochastic gradient descent.
In testing this model on custom created handwritten digits, if the handwritten digit is not centered, the model will have trouble classifying the custom input. The originators of the mnist dataset (see http://yann.lecun.com/exdb/mnist/) centered the images by computing its center of mass. It is also possible to center the images by the bounding box of the active pixels in an image for marginal improvements. Though this basic model reaches a testing accuracy of over 90%, it is significantly flawed. For example, the model will have trouble classifying custom digits if the line thickness and size of digit significantly differs from the training dataset.
It is also possible to add an additional hidden layer. The hidden layer we had used in the example is a fully connected layer as each of the weights had a mathematical connection to each of inputs of the next layer. Two consecutive fully connected layers would be ineffective due to it functioning as a single fully connected layer. This is because both layers can essentially be combined into one layer by matrix multiplying the weights of one layer into the weights of the next layer. It may be slightly different due to backpropagation and the implementation of biases. I've tested this and found the accuracy to be equivalent or less than a single-layer neural network.
This simple neural network is trained based on the location of active pixels in the input image rather than the presence of certain shapes at certain locations. Both of these problems can be solved by convolutional neural networks. CNN’s filter through portions of the image using various shape filters. Additionally, once filtering through the input, the image filters into different shape values before entering through the next layer to create shape combination values. In two consecutive fully connected layers with no convolutions, there is no novel information about the image being added into the inputs of the following layers.
The reason for this basic introduction into neural networks is to start with a foundation in order to build. Once the foundations are set, we can try out new ideas and test whether they create significant results. This is how new machine learning methods are developed and shared with the world.
import numpy as np
mnistlocation = “INSERT MNIST FILE LOCATION”
# For example: “/Users/enrichmentcap/Downloads/mndata.npz”
trainimages, trainlabels, testimages, testlabels =
np.load(mnistlocation)[“trainimages”] / 255,
np.load(mnistlocation)[“trainlabels”],
np.load(mnistlocation)[“testimages”] / 255,
np.load(mnistlocation)[“testlabels”] #division by 255 on image arrays to standardize pixel values between 0-1
##from tensorflow.keras.datasets import mnist #uncomment if using tensorflow library to retrieve dataset
##(trainimages, trainlabels), (testimages, testlabels) = mnist.load_data()
##trainimages, testimages = trainimages/255, testimages/255
classifications = np.unique(trainlabels).size
imagewidth = trainimages.shape[2]
imageheight = trainimages.shape[1]
learningrate = .01 #meant to be fine tuned
weights = np.zeros((classifications,imageheight,imagewidth))
biases = np.zeros(classifications)
for i in range(trainlabels.size): #cleanest, most intuitive NN; trains and tests over entire mnist dataset for simplicity; naming style of d"layer" / d"nextlayer" corresponds to calculus convention of writing derivatives
image = trainimages[i]
label = np.zeros(classifications)
label[trainlabels[i]] = 1
hiddenlayer = (image.reshape(1,imageheight,imagewidth) * weights).sum(axis=(1,2)) + biases
softmax = np.exp(hiddenlayer) / np.sum(np.exp(hiddenlayer))
dlossdsoftmax = (-1 * 2 * 1 * (label - softmax)) #L = (label - y) ** 2; L doesn't have to be a scalar; the point is that each L[i] converges to zero independently from other values of L through backpropagation
dsoftmaxdhiddenlayer = np.exp(hiddenlayer) * (np.sum(np.exp(hiddenlayer)) - np.exp(hiddenlayer)) / np.sum(np.exp(hiddenlayer))**2 #see softmax derivation discussion
dlossdhiddenlayer = dlossdsoftmax * dsoftmaxdhiddenlayer
dhiddenlayerdweights = image
dhiddenlayerdbiases = 1
dlossdweights = dhiddenlayerdweights.reshape(1, imageheight, imagewidth) * dlossdhiddenlayer.reshape(10, 1, 1) #for each classification, dlossdweights represents the input image multiplied by the corresponding classifications's one derived loss value.
dlossdbiases = dlossdhiddenlayer * dhiddenlayerdbiases
weights = weights - dlossdweights * learningrate
biases = biases - dlossdbiases * learningrate
accuracy = np.array( list(map(lambda testimage, testlabel : 1 if testlabel == np.argmax((testimage.reshape(1,imageheight,imagewidth) * weights).sum(axis=(1,2)) + biases) else 0, testimages, testlabels))) #one-liner
print(len(np.flatnonzero(accuracy==1))/len(accuracy)) #prints accuracy of trained neural network on testing dataset