Pretrained Object Detection

November 6, 2023

Compared to object detection, image classification is simple and straightforward. This is because object detection aims to classify objects and locate them within an image simultaneously. The shortcomings of image classification from datasets like the MNIST dataset is that they do not recognize multiple objects in one image and they do not specify exactly where the objects are in the image. Object detection is where neural networks become much more practical and relevant. In fact, it can be quite easy to make a model that can accomplish object detection tasks.

The major aspects that are necessary for creating a model that can actually be trained to accomplish an object detection task are the following:

1)      Encoding the dataset outputs of the input images

2)      Choosing an effective loss function

3)      Selecting a model that is pre-trained on image classification

The first problem we must answer is how to encode the label outputs which contains information about where an object is located and what the object is. Thankfully, the first YOLO paper answers this question. For the classification task, the image is evenly split into grid sections. These grid sections are tiled by the number of classes. This creates a shape of (grid_size, grid_size, classes). For the output encoded label, a value of 1 indicates that the center of an object is present in this grid section for a particular class. A value of 0 indicates that no object center is present. Due to the nature of this type of label encoding, only one classification can be used for every grid section. However, it is still possible to classify multiple objects in a single image.

For the object localization task, the image is also split evenly into grid sections, but tiled by 4 indicators of localization. These 4 indicators are bounding box coordinates. These coordinates are dependent on the classification portion of the label output, because they are only present if there is a value of 1 in the label output classification section. Otherwise, the coordinates are discarded. In YOLOv1, the first coordinate is the x-center coordinate in relation to the specific grid section cell. The second coordinate is the y-center coordinate in relation to the specific grid section. These xy-coordinates are values between 0 and 1, with a (0,0) x-y coordinate signifying the top-left corner of the grid section cell as the object center. A (1,1) x-y coordinate signifies the bottom-right corner of the grid section cell as the object center. This is because it is customary in object detection that coordinates start at the top left corner of the image and end at the bottom right corner. This is definitely not the cartesian coordinate system that we are used to from math class, but because we are using an automated machine learning framework. This nuance also applies to the Pytorch framework. The third coordinate is the width and the fourth coordinate is the height of the entire bounding box as a ratio of the entire image width and height.

For YOLOv1, constraining the xy-coordinates to a grid section cell of the image reduces the distance the xy-coordinates can be incorrect. This is like a safety measure and for accuracy we will implement this in the code below. It is possible to choose not constrain the xy-coordinates to a grid cell and allow the object center xy-coordinates to be anywhere in the image. We will be doing something very similar in the next article on how to train an object detection model starting from randomized weights.

The complete shape of the label encoded outputs are: (grid_size, grid_size, classes + coordinates). This tells us enough to classify and locate multiple objects in an image. Now in order for the model to predict these outputs between 0 and 1, the last layer can use a sigmoid activation function. We will not be using softmax activation because it is not possible for all values of a softmax activation to equal 0, which would be the case if no object center were present in a specific grid section.

Due to the complexity of object detection, loss functions applied to the entire predicted output like L1 (absolute value difference between label and prediction), L2 (squared difference between label and prediction), and binary cross entropy do not train fast enough to be effective. One way to address the complexity of the task while maintaining effectiveness of the loss function is to split the loss function into two parts: one for classification and one for object localization.

For the object localization loss function, we will be using L2 loss. For the classification task loss function, we will be using focal loss. As a background, binary cross entropy loss can be superior to L1 and L2 loss because of the shape of the loss curve, which have relatively higher gradients as loss increases compared to L1 and L2. Focal loss is derived from binary cross entropy and does an even better job of improving the loss curve gradients and assists with dataset class imbalance. This video explains these loss curves quite well.

Conveniently, these two loss functions can be computed completely separately and both divided by the number of positive samples (number of object centers present in the input image). These can then be added together to compute the total loss.

Though we have used the YOLO research paper for this article, we actually don’t need to use YOLO model structure. As long as we have effective label output encoding, loss functions, and a model that is deep enough (contains enough layers), we can actually create and see it learn how to do object detection. Tensorflow makes it easy to load an existing pre-trained model (see documentation). It provides many different models all trained for image classification on the imagenet dataset. In fact, the very first YOLO version was heavily pre-trained on the image net dataset for image classification before being fine tuned for object detection. In order to do the same thing with a different object detection model, we must load the pre-trained model in tensorflow, then remove the final image classification dense layer. We will substitute this with the decoupled head from the YOLOX paper. The beauty about the decoupled head is that it trains the classification task on a separate branch from the object localization task. Though this will require more trainable weights, but it learns object detection faster.

At this point, it is quite easy to try out different models that will all demonstrate the ability to learn object detection. However, we must choose usable parameters in order for the model to work. For example, the larger the batch size, the more the model can learn generally and reduce the possibility of overfitting. I have found it is best to use the largest batch size possible that the gpu can handle without crashing. It may be counterintuitive just how important a hyper-parameter like batch size is, but it can be crucially important.

Larger learning rates allow the model to traverse the optimization topography more and should theoretically skip over more local optima that could distract from the global optimum. Therefore, it seems intuitive to maintain the highest learn rate possible, until it is no longer effective. This should mean that the model has reached the general area of the global optimum but is too large to land directly in the global optimum. At this point the learning rate should be lowered, until this learning rate is no longer effective, and the process repeated until validation loss shows satisfactory results. I have found that the largest learning rate that can be used is .01. If larger than .01, exploding gradients appear and begin to ruin the model. When initially starting to train, loss should be the highest because the decoupled head is not trained at all. Beginning with a loss value too high can lead to exploding gradients early on. I have customized a learning rate schedule that keeps a high learning rate as long as possible until the learning becomes very slow as seen by diminishing returns on reducing loss.

The number of epochs should depend on the results of the loss (or validation loss) after each epoch regardless of the size of the dataset. Though a larger dataset may require less steps or iterations, the goal is for the loss values to be acceptable. So the idea is to keep going until the loss values are satisfactory.

Though in YOLOv1, bounding box loss is multiplied by alpha = 5 for positive samples, and YOLOv8 uses alpha = 7.5, we have used alpha = 10, because the L2 loss for bounding box regression is very slow. 

The process of using a pretrained model on one task and then using it for another task is transfer learning. When we use a pretrained model and substitute layers, it is best to freeze the pretrained layers while the new layers adjust to the layers that have already been trained on image data. After a few epochs and the new layers have caught up with the pretrained layers, we can unfreeze the pretrained layers and allow all the layers to be adjusted together. This is called fine-tuning. It is possible for a model to learn without training all the layers. Though fine-tuning can produce better results, this model will demonstrate that only training the end substitution layers can be enough. As you can see from the output of the code, it only took 16 epochs to start seeing the model demonstrate learning object detection and it wasn't even necessary to drop the learning rate. Though the bounding boxes aren't great, it is amazing to see how a pre-trained model can learn so quickly by training just a few layers. 

The point of this article is to strip down object detection training into its necessities, while still being able to learn object detection. YOLOv1 makes multiple bounding box predictions per grid cell, but the complications that come with decoding the output is unnecessary for this stripped down version of object detection. Many YOLO versions also use anchor boxes, iou predictions and/or confidence scores. I have found that these can be removed from the model while still maintaining functionality. Therefore, they are omitted in the tensorflow model provided in the code below. 


import tensorflow as tf

import tensorflow_datasets as tfds


# wd = ‘the filepath on your computer where you’re working files are in’

# voc = tfds.load('voc', split='train+test', shuffle_files = True, data_dir= wd)

# val = tfds.load('voc', split='validation', shuffle_files = False, data_dir= wd)


presize = 299

gsize = 10

base = tf.keras.applications.Xception(weights = 'imagenet')

base.trainable = False

base_model = base.layers[-3].output


classes = 20

coords = 4

batch_size = 32

gamma = 5.

alpha = 10.


def imglab0(x):

    image, c, b = x['image'], x['objects']['label'], x['objects']['bbox']

    image = tf.image.resize(image, [presize,presize]) / 255.

    tfbase = tf.zeros([gsize,gsize,classes+coords])

    gsx = gsize

    gsy = gsize

    for i in range(len(c)):

        cls = c[i]

        ymin = b[i,0] #for whatever reason, reversed yx-order compared to documentation

        xmin = b[i,1]

        ymax = b[i,2]

        xmax = b[i,3]

        x = tf.reduce_mean([xmin,xmax]) #(xmin + xmax) / 2

        y = tf.reduce_mean([ymin,ymax]) #(ymin + ymax) / 2

        w = xmax - xmin

        h = ymax - ymin

        

        loc_y = (gsize * y) // 1 #represents which grid section; (0,7). Can't be 0 or 7 because x,y are object center coords.

        loc_x = (gsize * x) // 1

        

        xloc = gsx * x - loc_x

        yloc = gsy * y - loc_y

        

        if (w * h) > (tfbase[int(loc_x), int(loc_y), -2] * tfbase[int(loc_x), int(loc_y), -1]): #grid cell accepts largest object

            clsbox = tf.reshape(tf.concat([tf.one_hot(cls, classes), (xloc, yloc, w, h)], axis = 0), (1, classes + coords)) #[1,classes + coords]

            zeroyup = tf.zeros([gsy - (loc_y + 1), classes + coords]) #[?,classes + coords], for y stack above

            zeroydown = tf.zeros((loc_y, classes + coords)) #[?,classes + coords], for y stack below

            ystack = tf.reshape(tf.concat((zeroydown, clsbox, zeroyup), axis = 0), (1, gsy, classes + coords)) #[1,gsize,classes + coords]

            zeroxup = tf.zeros((gsx - (loc_x + 1), gsy, classes + coords)) #[?,gsize,classes + coords], for x stack above

            zeroxdown = tf.zeros((loc_x, gsy, classes + coords)) #[?,gsize,classes + coords], for x stack below

            xystack = tf.concat((zeroxdown, ystack, zeroxup), axis = 0)

            noobject_mask = 1 - tf.reduce_sum(xystack[...,:classes], axis = -1, keepdims = True) #this is mask for erasing existing tfbase[loc_x,loc_y,:] values

            tfbase = noobject_mask * tfbase + xystack #itemwise multiplication [gsize,gsize,1] * [gsize,gsize,classes+coords]

            

    return image, tfbase


def yolo_loss(y_true, y_pred):


    object_mask = tf.reduce_max(y_true[...,:classes], axis=-1, keepdims= True)

    

    sigmoid_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = y_true[...,:classes], logits = y_pred[...,:classes]) #focal loss

    pred_prob = tf.sigmoid(y_pred[...,:classes])

    mod_factor = y_true[...,:classes] * tf.pow(1. - pred_prob, gamma) + (1. - y_true[...,:classes]) * tf.pow(pred_prob, gamma)

    class_loss = mod_factor * sigmoid_loss

    class_loss = tf.math.divide_no_nan(tf.reduce_sum(tf.reduce_mean(class_loss, axis = -1)), tf.reduce_sum(object_mask))

    

    label_xywh = y_true[...,classes:classes+coords] #L2 loss block

    yp2 = tf.nn.sigmoid(y_pred[...,classes:classes+coords])

    area = label_xywh[...,-2:-1] * label_xywh[...,-1:]

    box_loss = alpha * object_mask * tf.square(label_xywh - yp2) * (2 - area)

    box_loss = tf.math.divide_no_nan(box_loss , tf.reduce_sum(object_mask))

    

    return class_loss + box_loss


from tensorflow.keras.layers import Conv2D, Input, BatchNormalization, MaxPooling2D, ZeroPadding2D


def convolutional(input_layer, filters, kernel_size, downsample=False):

    if downsample:

        if (input_layer.shape[1] - filters) // 2 + 1 != input_layer.shape[1] // 2:

            input_layer = ZeroPadding2D(((1, 0), (0, 1)))(input_layer)

        padding = 'valid'

        strides = 2

    else:

        strides = 1

        padding = 'same'


    conv = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding,

                  kernel_initializer='he_normal', #see article on randomized weights

                 )(input_layer)


    conv = BatchNormalization()(conv)  

    conv = tf.keras.activations.relu(conv)


    return conv


def detect(input_layer):

    cls = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head

    cls = convolutional(cls, 256, kernel_size = 3)

    cls = Conv2D(filters = classes, kernel_size = 1, strides = 1)(cls)

    

    bbox = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head

    bbox = convolutional(bbox, 256, kernel_size = 3)

    bbox = Conv2D(filters = coords, kernel_size = 1, strides = 1)(bbox)

    

    return tf.concat([cls, bbox], axis = -1)


outputs = detect(base_model)


model = tf.keras.Model(inputs=base.input, outputs=outputs)


voc= voc.cache().shuffle(voc.cardinality(), reshuffle_each_iteration=True).map(imglab0, num_parallel_calls=tf.data.AUTOTUNE).batch(batch_size, drop_remainder=True)

voc = voc.prefetch(tf.data.AUTOTUNE)

epoch_steps = len(voc)


val = val.map(imglab0, num_parallel_calls=tf.data.AUTOTUNE).batch(batch_size, drop_remainder=True)

val = val.prefetch(tf.data.AUTOTUNE)


lr = tf.Variable(1e-2)

decay = (1e-1)**(1/7)

oldloss = tf.Variable(100.)

class LRfunc(tf.keras.callbacks.Callback):

    def on_epoch_begin(self, epoch, logs=None):

        lra = self.model.optimizer.learning_rate(epoch)

        print('lr:', lra.numpy())

        return


    def on_epoch_end(self, epoch, logs=None):

        logs = logs or {}

        newloss = logs.get("loss")

        if lr < 1e-7:

            lr.assign(1e-3)

        elif oldloss - newloss < oldloss * .01:

            lr.assign(lr * decay)

        oldloss.assign(newloss)

        return


file_name = '/whatever you want to call the weights file.hdf5'

mcp_save = tf.keras.callbacks.ModelCheckpoint(wd+file_name, save_best_only=True, monitor='val_loss', save_weights_only=True)


burnin = 200

model.compile(loss = yolo_loss, optimizer=tf.keras.optimizers.SGD(learning_rate = tf.keras.optimizers.schedules.CosineDecay(initial_learning_rate = lr, decay_steps = 1, alpha = 1e0), momentum = .937))

model.fit(voc, epochs = burnin, validation_data = val, callbacks=[LRfunc(), mcp_save]

)

# Output:

lr: 0.01

Epoch 1/200

232/232 [==============================] - 36s 134ms/step - loss: 0.1300 - val_loss: 0.0374

lr: 0.01

Epoch 2/200

232/232 [==============================] - 31s 133ms/step - loss: 0.0336 - val_loss: 0.0287

lr: 0.01

Epoch 3/200

232/232 [==============================] - 31s 136ms/step - loss: 0.0239 - val_loss: 0.0225

lr: 0.01

Epoch 4/200

232/232 [==============================] - 31s 136ms/step - loss: 0.0191 - val_loss: 0.0199

lr: 0.01

Epoch 5/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0168 - val_loss: 0.0187

lr: 0.01

Epoch 6/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0154 - val_loss: 0.0180

lr: 0.01

Epoch 7/200

232/232 [==============================] - 31s 134ms/step - loss: 0.0146 - val_loss: 0.0176

lr: 0.01

Epoch 8/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0139 - val_loss: 0.0174

lr: 0.01

Epoch 9/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0134 - val_loss: 0.0172

lr: 0.01

Epoch 10/200

232/232 [==============================] - 31s 134ms/step - loss: 0.0129 - val_loss: 0.0170

lr: 0.01

Epoch 11/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0125 - val_loss: 0.0170

lr: 0.01

Epoch 12/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0122 - val_loss: 0.0169

lr: 0.01

Epoch 13/200

232/232 [==============================] - 31s 133ms/step - loss: 0.0118 - val_loss: 0.0169

lr: 0.01

Epoch 14/200

232/232 [==============================] - 31s 135ms/step - loss: 0.0116 - val_loss: 0.0169

lr: 0.01

Epoch 15/200

232/232 [==============================] - 31s 134ms/step - loss: 0.0112 - val_loss: 0.0171

lr: 0.01

Epoch 16/200

232/232 [==============================] - 31s 132ms/step - loss: 0.0109 - val_loss: 0.0171