FFT from Scratch

November 10, 2023


Now that we have covered various image classification machine learning tasks, we can work with audio data. This requires that we understand general differences between audio and image data. As we have seen in previous articles, image data is expressed spatially with different color values for every pixel location. This makes sense because of how a camera sensor works. It senses light values at different spatial locations and records them. A microphone senses sound waves electronically and records the electrical signals through time. Uncompressed audio data is simply various voltage readings through time. Interestingly, the raw audio data is not fit for a machine learning classification task yet. It must first be transformed into data that resembles an image. One way to do this is through the Fourier Transform. This transforms the raw audio data into frequency information. This means that every possible frequency is measured throughout the raw audio data. This can be done through the whole sample or more commonly, through subsamples or windows of the data. These frequency measurements are conducted by way of dot products between the audio data and each sin wave possible for the audio data. Therefore, the Fourier Transform output x-axis displays the frequency, and the y-axis value represents the amplitude of that frequency present in the audio sample.

The FFT is brilliant for converting data into frequency information. Actually, conducting an FFT on an FFT will bring the frequency information back into the original time domain. This is great because it allows one to manipulate the frequencies from the FFT output, and then convert it back into raw audio data. Though perhaps I am oversimplifying a complex process, I will demonstrate that my take on FFT’s and inverse FFT’s are intuitive and useful in practice. The code below demonstrates that an FFT can be done from scratch with the sine and cosine function. Additionally, it can be done with a sine function and the sine function offset by pi / 2 which is equivalent to the cosine function. We can prove that the code works because doing an additional FFT on the FFT produces an audio comparable to the raw audio. The dataset we will be using is the audio version of the MNIST called the “spoken digit” dataset. Here is the plot of the sample raw audio data we are using in the code below.

Below is the image of an FFT performed on the raw audio data using the code below.

import librosa

import numpy as np

import matplotlib.pyplot as plt

import IPython.display as ipd

 

import tensorflow as tf

import tensorflow_datasets as tfds

 

# !pip install pydub #may need this dependency using jupyter notebook like colab

train = tfds.load('spoken_digit', split = 'train[10%:]')

# val = tfds.load('spoken_digit', split = 'train[:10%]') #test set according to documentation

 

x, = train.filter(lambda x: x['audio/filename'] == '0_theo_0.wav').take(1)

x = x.numpy()

 

def fftmanual(x):

  fft = np.zeros(len(x))

  for i in range(len(fft)):

    sine = x * np.sin( np.linspace(0, i * 2 * np.pi, len(x)) )    #this loses information likely due to rounding error.

    cos = x * np.sin( np.linspace(0+np.pi/2, i * 2 * np.pi + np.pi/2, len(x)) ) #np.cos( np.linspace(0, i * 2 * np.pi, len(x)) )

    fft[i] = (sine + cos).sum()   #real.T almost == real likely due to rounding error;

  return fft

 

def efftmanual(x):

  fft = np.zeros(len(x))

  for i in range(len(fft)):  #dont actually need imaginary number; however e^-i2piX has less info loss, likely due to sin,cos rounding error

    fft[i] = (np.exp(2j * np.linspace(0, i , len(x))) * x).sum()

  return fft

 

test = fftmanual(x)

plt.plot(fftmanual(test))

ipd.Audio(fftmanual(test), rate = sr)

 

# test = efftmanual(x) #try this too

# plt.plot(efftmanual(test))

# ipd.Audio(efftmanual(test), rate = sr)

The above inverse FFT is slightly different than the original, but the audio playback is indistinguishable.

In case you're curious about why the code was written in such a way, there are two nuances that are important to understand in the Fourier Transform process. The first is to understand mirrored frequencies. For example, let’s say our raw audio data is a vector of 3600 voltage readings. We will first do a dot product of the raw audio vector and a sine wave of frequency of 1 across 3600 points. Then we will do the same thing except with a sine wave of frequency of 2, then again with 3, until we get to a frequency of 3600 across 3600 points. Yet a complete sine wave represents a complete cycle of 0, then to 1, then back to 0, then -1, then back to 0. This is 5 points along a vector. How can this complete cycle occur every point? This leads us to the mirrored frequencies and the Nyquist theorem. This is why half of the Fourier Transform output contains information we need for machine learning.

To illustrate, we can create an audio sample of a sine wave of frequency of 1 sampled over 20 points. When we apply our FFT transform to this sample, we will see the output only has two mirrored non-zero values at a frequency of 1 and a frequency of 18. It turns out in this example that 0 should be the mirrored pair of 19, 1 should be the mirrored pair of 18, 2 should be the mirrored pair of 17, and so on. If the number of points is “n” and we start at 0, then each number in the dataset “i” has a corresponding pair with “(n-1) - i”.

This can be demonstrated further by increasing the resolution of the sine wave to 640 points. We see that the audio sample looks generally the same but smoother. Now the mirror frequency of 18 shows the full 18 oscillations with the higher resolution of 640 points. This is just an illustration, but FFT’s aren’t taken at resolutions higher than the number of sample points of the input. So when we bring the mirrored frequency of 18 back to a resolution of 20, which emulates what an FFT would do in practice, we see that the mirror frequency has the same frequency as the audio sample, just mirrored across the x-axis. See code and output images below.

import numpy as np

import matplotlib.pyplot as plt

t = np.sin( np.linspace(0, 1 * 2 * np.pi, 20))

plt.xticks(np.arange(20))

plt.plot(t)

test = fftmanual(t)

plt.xticks(np.arange(20))

plt.plot(test)

t = np.sin( np.linspace(0, 1 * 2 * np.pi, 640))

ttt = np.sin( np.linspace(0, 18 * 2 * np.pi, 640))

plt.xticks(range(0,64*(10 + 1),64))

plt.plot(t, color = 'orange')

plt.plot(ttt, color = 'green')

t = np.sin( np.linspace(0, 2 * 2 * np.pi, 20))

ttt = np.sin( np.linspace(0, 17 * 2 * np.pi, 20))

plt.xticks(np.arange(20))

plt.plot(t, color = 'orange')

plt.plot(ttt, color = 'green')

The second nuance is that a single sine wave function is not enough to convert audio data into frequency data. For example, let’s say that the raw audio data is simply a cosine wave of frequency of 1 across 3600 points. If a dot product occurs between this audio data and a sine wave of frequency of 1 across 3600 points, it should result in a amplitude of 0, which can be seen in the image below. An amplitude of 0 for a frequency of 1 would indicate that this frequency is not present in the audio data. 

The solution for this is to aggregate a sine and cosine wave for every frequency possible in the raw audio data. This ensures that no frequencies get canceled out.

The aggregation of sine and cosine waves across every possible frequency represents a transform that will work. The Fast Fourier Transform (FFT) uses a math transform to accomplish the same thing. You may remember when our calculus teachers glossed over the expansion series for e(x), cos(x), sin(x). It turns out that when using imaginary numbers (i), our expansion series prove:

 e(i*x) = cos(x) + i*sin(x)

Therefore, instead of using cos(x) + sin(x) in our original transformation, we can use e(i*x). Conveniently, imaginary or complex numbers are calculated separately from real numbers in python. Therefore, when plotting, the imaginary numbers are generally discarded and we are left with the useful real numbers. In conclusion, the benefit of using Euler’s formula in FFT’s is that it is faster to compute and is less liable to rounding errors that occur in sine and cosine manual transform we did. The cost is that we have to deal with imaginary numbers, which python makes much less complex.