Object Detection from Scratch

November 6, 2023


Though it is much easier to discard many machine learning techniques when using a pre-trained model, it is certainly not the case when starting from scratch using randomized weights. In fact, training an object detection model from scratch can be so difficult that Microsoft does not even recommend trying. I have found that training from scratch is essentially a completely different task than using a pre-trained model.

I cannot overemphasize how sensitive a model can be when training from scratch relative to using a pretrained model. This is why we must be careful with the data coming into the model. The YOLOv3 paper proposes that the COCO dataset may have better labeling accuracy than the smaller VOC dataset. In fact, since YOLOv2, research in object detection has migrated towards the COCO dataset, which has 80 classifications and well over 160,000 labeled images.

Like the VOC dataset, the outliers in the COCO dataset are very large. It also turns out that COCO classifications are severely imbalanced with a large majority of images containing people. Training from the imbalanced dataset causes many false positives towards the classifications that dominate the dataset. We will first have to reduce the number of the large outlier classifications. We can do this by removing all the images with this classification present, which will not be enough. Then we will do the same thing again with the images with people present with the other large outlier classifications. This doesn’t perfectly balance the dataset, but it makes the classification distribution much easier to work with.

For this article we must encode the labels in such a way that is conducive to learning. The encoding we used in the LAST ARTICLE included the x-y coordinate of the object center and the width and height of the object as a ratio of the width and height of the image. A better way to facilitate learning from scratch is to encode all 4 object coordinates as distances from grid points across the image. The benefit of this is that the model learns spatial awareness and it is able to make multiple predictions for a single object using distance coordinates that depend on the grid point that is within the object bounding box. Like YOLOv1, the image is divided into grid cells. But instead, we can take these grid point locations within an object bounding box, and measure the distance between a certain grid point location and the left edge of the bounding box; we can do the same thing between that certain grid point location and the top edge, right edge, and bottom edge of the bounding box of the object. In this way, for the same object, different grid point locations will come with a completely different set of coordinates. This is how the model is able to learn spatial awareness within the image.

We will be using the majority of the YOLOv8 structure which implements a feature pyramid network and makes predictions across 3 scales. This means that the label output of the image will have 3 separate predictions, which comprise 3 semi-overlapping pathways for backpropagation. This can be very complicated, yet I have chosen to use it simply because it is effective with the difficult task of training from scratch. We have scaled down the input size to 320x320 because it is possible to train a model from scratch without data augmentation like mosaic. This results in three different prediction scales of 10x10, 20x20, and 40x40. In theory, the higher resolution scales are supposed to be better at predicting small objects, and the lower resolution scales are supposed to be better at predicting larger objects.

Lastly, the loss function will need to be completely upgraded compared to the pre-trained model from the LAST ARTICLE. Firstly, the location of the object is related to the classification of the object. Additionally, while decoding the output labels, the location of the object is dependent on the classification score of the object. An object is decoded as present if it has reached a certain classification score. The bounding box of an object should be based on the bounding box coordinates the model has learned for a particular classification. Therefore, it isn’t completely accurate to structure them as separate tasks. Therefore, the loss function used in this article will be cross trained. The classification loss of the object at a certain location will be dependent on the localization score of the predicted object at that location. The localization loss of an object at a certain location will be dependent on the classification score of the correct object class. An additional benefit of this cross training loss is that it reduces the need for hyper-parameter tuning of box regression loss and classification loss weight factoring.

For the object localization loss function, we will be using complete intersection over union loss (CIOU). This loss function, measures the intersection between the prediction and the object in order to best propagate the loss to the bounding box coordinates. This video  (https://www.youtube.com/watch?v=4wXXNQ4Ylrk&list=LL&index=13) sufficiently explains why certain types of intersection over union loss is superior to L1,L2, and other IOU losses. Classification loss will be based on focal loss used in the LAST ARTICLE.

Once the hyper-parameters are set like in the LAST ARTICLE, the model is ready to be trained. As a final note, the model created in this article is meant to display the bare necessities of training a model from scratch. It is meant to give insight on which machine learning techniques are essential and which are over-emphasized yet can be discarded. This model can achieve learning without regularization, anchor boxes, a distinct confidence score, typical “xywh” encoding, data augmentation, or complicated learning rate schedules. This model can even perform significant learning simply using RELU activation functions and the least amount of convolutional filters according to YOLOv8 nano. See code below.

import tensorflow as tf

import tensorflow_datasets as tfds


wd = ‘the filepath on your computer where you’re working files are in’

coco = tfds.load('coco', split='train[:40672]', shuffle_files = True, data_dir= wd) #only used portion of dataset; you can choose to use the whole thing for better accuracy

# val = tfds.load('coco', split='validation', shuffle_files = True, data_dir= wd) #can choose to check val accuracy



presize = 320

lms = [10,20,40]

classes = 80

coords = 4

batch_size = 32

gamma = 5.


d = 1/3 #yolov8 nano kernel sizes

w = 1/4

r = 2.


def imglab0(x): #standard image

    image= x['image']

    image = tf.image.resize(image, [presize,presize]) / 255.

    b = x['objects']['bbox']

    c = x['objects']['label']


    for ii,jj in enumerate(lms):

        gsx = jj

        gsy = jj

        gsize = jj

        tfbase = tf.zeros([gsx,gsy,classes+coords], dtype = tf.float32)

        for i in range(len(b)):

            cls = int(c[i])

            ymin = b[i,0] #for whatever reason, reversed yx-order compared to documentation

            xmin = b[i,1]

            ymax = b[i,2]

            xmax = b[i,3]

            

            w = xmax - xmin

            h = ymax - ymin


            gsxmin = 1 + (xmin * gsx // 1) # this ensures positive ltrb outputs

            gsxmax = xmax * gsx // 1

            gsymin = 1 + ymin * gsy // 1

            gsymax = ymax * gsy // 1

            

            xgrid = tf.tile(tf.reshape(tf.range(gsx, dtype = tf.float32) , [gsx, 1, 1]), [1, gsy, 1])

            ygrid = tf.tile(tf.reshape(tf.range(gsy, dtype = tf.float32) , [1, gsy, 1]), [gsx, 1, 1])

            xmask = tf.where((xgrid < gsxmin) | (xgrid > gsxmax), x = 0., y= 1.)

            ymask = tf.where((ygrid < gsymin) | (ygrid > gsymax), x = 0., y= 1.)

            mask = xmask * ymask

            

            coh = tf.tile(tf.reshape(tf.one_hot(cls, classes, dtype = tf.float32), [1, 1, classes]), [gsx, gsy, 1])


            cmask = mask * coh


            left = mask * (xgrid/gsx - xmin)

            top = mask * (ygrid/gsy - ymin)

            right = mask * (xmax - xgrid/gsx)

            bottom = mask * (ymax - ygrid/gsy)


            mbase = tf.concat([cmask, left, top, right, bottom], axis = -1)

            tfbb = (tfbase[..., classes:classes+1] + tfbase[..., classes+2:classes+3]) * (tfbase[..., classes+1:classes+2] + tfbase[..., classes+3:classes+4])

            tfob = tf.reduce_max(tfbase[...,:classes], axis = -1, keepdims = True)

            tfnoob = (1 - tfob)

            mbb = (left + right) * (top + bottom)

            mbob = mask

            mbnoob =  (1 - mask)

            

            tferase = mbnoob * tf.where(mbb * tfob * mbob < tfbb * tfob * mbob, x = tf.cast(0, dtype = tf.float32), y = tf.cast(1, dtype = tf.float32)) #if bb is less than tfbb erase from tfbase, if bb is greater than tfbb keep in tfbase

            mberase = tf.maximum((1. - tferase), (mbob * tfnoob))

                

            tfbase = tferase * tfbase + mberase * mbase


        if jj == lms[0]:

            lbox = tfbase

        elif jj == lms[1]:

            mbox = tfbase

        else:

            sbox = tfbase


    lbox =  tf.image.pad_to_bounding_box(lbox, 0,0, lms[-1], lms[-1])

    lbox = tf.expand_dims(lbox, axis = 0)


    mbox =  tf.image.pad_to_bounding_box(mbox, 0,0, lms[-1], lms[-1])

    mbox = tf.expand_dims(mbox, axis = 0)


    sbox = tf.expand_dims(sbox, axis = 0)

    

    return image, tf.concat([lbox,mbox,sbox], axis = 0)


def numx(x):

    t = x['bbox'][:,-1]

    t = tf.where(t == 0, x=1, y=0)

    return tf.reduce_sum(t) < 2


person = coco.filter(lambda x: tf.reduce_any(x['objects']['label'][:,-1] == 0) and numx(x) and (tf.reduce_any(x['objects']['label'][:,-1] == 56) or tf.reduce_any(x['objects']['label'][:,-1] == 2) or tf.reduce_any(x['objects']['label'][:,-1] == 60)) ).take(5000)


rest = coco.filter(lambda x: tf.reduce_all(x['objects']['label'][:,-1] != 0) )


findata = rest.concatenate(person)

coco = findata.shuffle(8000, reshuffle_each_iteration=True).map(imglab0, num_parallel_calls=tf.data.AUTOTUNE).batch(batch_size, drop_remainder=True).prefetch(tf.data.AUTOTUNE)


from tensorflow.keras.layers import Conv2D, Input, BatchNormalization, MaxPooling2D, ZeroPadding2D


def convolutional(input_layer, filters, kernel_size, downsample=False):

    if downsample:

        input_layer = ZeroPadding2D(((0, 1), (0, 1)))(input_layer)

        padding = 'valid'

        strides = 2

    else:

        strides = 1

        padding = 'same'


    conv = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides,

                  padding=padding, #use_bias=False, kernel_regularizer=l2(0.0005),

                  kernel_initializer='he_normal',

                 )(input_layer)


    conv = BatchNormalization()(conv)  

    conv = tf.keras.activations.relu(conv)

    return conv


def bottle(input_layer, shortcut = True):

    res = input_layer

    filters = input_layer.shape[-1]

    conv = convolutional(input_layer, filters = filters, kernel_size = 3)

    conv = convolutional(conv, filters = filters, kernel_size = 3)

    if shortcut:

        conv = res + conv

    return conv


def c2f(input_layer, filters, shortcut = True):

    res = convolutional(input_layer, filters = filters, kernel_size = 1)

    res0, conv = tf.split(res, 2, axis = -1)

    n = int(6 * d)

    out = tf.concat([res0,conv], axis = -1)

    for i in range(n):

        conv = bottle(conv, shortcut = shortcut)

        out = tf.concat([out,conv], axis = -1)

    route = convolutional(out, filters = filters, kernel_size = 1)

    return route


def sppf(input_layer):

    route = convolutional(input_layer, filters = int(512*w*r/4), kernel_size = 1)

    mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(route)

    route = tf.concat([route,mp], axis = -1)

    mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(mp)

    route = tf.concat([route,mp], axis = -1)

    mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(mp)

    route = tf.concat([route,mp], axis = -1)

    route = convolutional(route, filters = int(512*w*r), kernel_size = 1)

    return route


def detect(input_layer):

    cls = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head

    cls = convolutional(cls, 256, kernel_size = 3)

    cls = Conv2D(filters = classes, kernel_size = 1, strides = 1)(cls)

    

    bbox = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head

    bbox = convolutional(bbox, 256, kernel_size = 3)

    bbox = Conv2D(filters = coords, kernel_size = 1, strides = 1)(bbox)

    

    return tf.concat([cls, bbox], axis = -1)


def fpn(lroute,mroute,sroute):


    route = sppf(lroute)

    lroute = route

    route = tf.image.resize(route, [route.shape[1] * 2, route.shape[2] * 2], method = 'nearest')

    route = tf.concat([route, mroute], axis = -1)

    route = c2f(route, filters = int(512*w), shortcut = False)

    mroute = route


    route = tf.image.resize(route, [route.shape[1] * 2, route.shape[2] * 2], method = 'nearest')


    route = tf.concat([route, sroute], axis = -1)

    route = c2f(route, filters = int(256*w), shortcut = False)

    sroute = route

    sroute = detect(sroute)


    route = convolutional(route, filters = int(256*w), kernel_size = 3, downsample = True)

    route = tf.concat([route, mroute], axis = -1)

    route = c2f(route, filters = int(512*w), shortcut = False)

    mroute = route

    mroute = detect(mroute)


    route = convolutional(route, filters = int(512*w), kernel_size = 3, downsample = True)

    route = tf.concat([route, lroute], axis = -1)

    lroute = c2f(route, filters = int(512*w*r), shortcut= False)

    lroute = detect(lroute)


    lroute =  tf.image.pad_to_bounding_box(lroute, 0,0, lms[-1], lms[-1])

    lroute = tf.expand_dims(lroute, axis = 1)


    mroute =  tf.image.pad_to_bounding_box(mroute, 0,0, lms[-1], lms[-1])

    mroute = tf.expand_dims(mroute, axis = 1)


    sroute = tf.expand_dims(sroute, axis = 1)


    outputs = tf.concat([lroute,mroute,sroute], axis = 1) #[batch_size, lms, gsize, gsize, classes+coords+conf]


    return outputs


inputs = Input([presize,presize,3]) #yolov8 scratch block

route = convolutional(inputs, filters = int(64*w), kernel_size = 3, downsample = True)

route = convolutional(route, filters = int(128*w), kernel_size = 3, downsample = True)

route = c2f(route, filters = int(128*w), shortcut = True)

route = convolutional(route, filters = int(256*w), kernel_size = 3, downsample= True)

route = c2f(route, filters = int(256*w), shortcut = True)

sroute = route


route = convolutional(route, filters = int(512*w), kernel_size = 3, downsample = True)

route = c2f(route, filters = int(512*w), shortcut = True)

mroute = route


route = convolutional(route, filters = int(512*w*r), kernel_size = 3, downsample = True)

route = c2f(route, filters = int(512*w*r), shortcut = True)

lroute = route


outputs = fpn(lroute,mroute,sroute)


model = tf.keras.Model(inputs, outputs)


def bbox_ciou(b_true, b_pred, gsize):

    

    gsx, gsy = gsize, gsize


    xgrid = tf.tile(tf.reshape(tf.range(gsx, dtype = tf.float32) , [1, gsx, 1, 1]), [batch_size, 1, gsy, 1])

    ygrid = tf.tile(tf.reshape(tf.range(gsy, dtype = tf.float32) , [1, 1, gsy, 1]), [batch_size, gsx, 1, 1])

    

    lxmin = tf.maximum(0., xgrid/gsx - b_true[..., 0:1])

    lymin = tf.maximum(0., ygrid/gsy - b_true[..., 1:2])

    lxmax = tf.minimum(1., xgrid/gsx + b_true[..., 2:3])

    lymax = tf.minimum(1., ygrid/gsy + b_true[..., 3:4])

    

    b_true_w = (lxmax - lxmin)

    b_true_h = (lymax - lymin)

    

    pxmin = tf.maximum(0., xgrid/gsx - b_pred[..., 0:1])

    pymin = tf.maximum(0., ygrid/gsy - b_pred[..., 1:2])

    pxmax = tf.minimum(1., xgrid/gsx + b_pred[..., 2:3])

    pymax = tf.minimum(1., ygrid/gsy + b_pred[..., 3:4])

    

    b_pred_w = (pxmax - pxmin)

    b_pred_h = (pymax - pymin)


    b_true_mins = tf.concat([lxmin, lymin], axis = -1)

    b_true_maxes = tf.concat([lxmax, lymax], axis = -1)

    

    b_pred_mins = tf.concat([pxmin, pymin], axis = -1)

    b_pred_maxes = tf.concat([pxmax, pymax], axis = -1)

    

    intersect_mins = tf.maximum(b_true_mins, b_pred_mins)

    intersect_maxes = tf.minimum(b_true_maxes, b_pred_maxes)

    intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.)

    intersect_area = intersect_wh[..., 0:1] * intersect_wh[..., 1:2]

        

    b_true_area = b_true_w * b_true_h

    b_pred_area = b_pred_w * b_pred_h

    

    union_area = b_true_area + b_pred_area - intersect_area

    

    # calculate IoU, add epsilon in denominator to avoid dividing by 0

    iou = tf.math.divide_no_nan(intersect_area, union_area)


    # get enclosed area

    enclose_mins = tf.minimum(b_true_mins, b_pred_mins)

    enclose_maxes = tf.maximum(b_true_maxes, b_pred_maxes)

    enclose_wh = tf.maximum(enclose_maxes - enclose_mins, 0.)


    # box center distance

    b_true_x = tf.reduce_mean(tf.concat([lxmin, lxmax], axis = -1), axis = -1, keepdims = True)

    b_true_y = tf.reduce_mean(tf.concat([lymin, lymax], axis = -1), axis = -1, keepdims = True)

    b_true_xy = tf.concat([b_true_x, b_true_y], axis = -1)

    

    b_pred_x = tf.reduce_mean(tf.concat([pxmin, pxmax], axis = -1), axis = -1, keepdims = True)

    b_pred_y = tf.reduce_mean(tf.concat([pymin, pymax], axis = -1), axis = -1, keepdims = True)

    b_pred_xy = tf.concat([b_pred_x, b_pred_y], axis = -1)

    

    center_distance = tf.reduce_sum(tf.square(b_true_xy - b_pred_xy), axis = -1, keepdims = True)


    # get enclosed diagonal distance

    enclose_diagonal = tf.reduce_sum(tf.square(enclose_wh), axis = -1, keepdims = True)

    

    # calculate DIoU, add epsilon in denominator to avoid dividing by 0

    diou = iou - tf.math.divide_no_nan(center_distance, enclose_diagonal)


    b_true_area = (b_true[...,2] - b_true[...,0]) * (b_true[...,3] - b_true[...,1])

    b_pred_area = (b_pred[...,2] - b_pred[...,0]) * (b_pred[...,3] - b_pred[...,1])

    

    pi = 3.14159265359

    

    v = (4 / pi ** 2) * tf.square(tf.math.atan2(b_true_w, b_true_h) - tf.math.atan2(b_pred_w, b_pred_h))

    alpha = tf.math.divide_no_nan(v, ((1.0) - iou + v))

    ciou = diou - alpha*v


    return ciou


def inloss(y_true, y_pred, gsize):


    object_mask = tf.where(tf.reduce_max(y_true[...,:classes], axis=-1, keepdims= True) > 0., x = 1., y = 0.)

    

    area = tf.expand_dims((y_true[...,classes] + y_true[...,classes+2]) * (y_true[...,classes+1] + y_true[...,classes+3]), axis = -1)

    pred_ltrb = tf.nn.sigmoid(y_pred[...,classes:classes+coords])

    label_ltrb = y_true[...,classes:classes+coords]

    

    track = tf.reduce_max(y_true[...,:classes] * tf.nn.sigmoid(y_pred[...,:classes]), axis = -1, keepdims = True)

    

    ciou = bbox_ciou(label_ltrb, pred_ltrb, gsize)

    classciou = y_true[...,:classes] * ciou + (1. - y_true[...,:classes]) #cross train only active only for positive samples

    bbox_loss_scale = (2 - area)

    

    box_loss = object_mask * bbox_loss_scale * (1. - ciou*track)

    box_loss = tf.math.divide_no_nan(tf.reduce_sum(box_loss), tf.reduce_sum(object_mask))

    

    pred_prob = tf.sigmoid(y_pred[...,:classes])

    pred_prob = pred_prob * classciou #cross train

    mod_factor = y_true[...,:classes] * tf.pow(1. - pred_prob, gamma) + (1. - y_true[...,:classes]) * tf.pow(pred_prob, gamma)

    class_loss = mod_factor * tf.keras.losses.BinaryCrossentropy(axis = -1, reduction = 'none')(y_true = tf.expand_dims(y_true[...,:classes], axis = -1), y_pred = tf.expand_dims(pred_prob, axis = -1))


    class_loss = tf.math.divide_no_nan(tf.reduce_sum(tf.reduce_mean(class_loss, axis = -1)), tf.reduce_sum(object_mask))

    

    return class_loss + box_loss


def yolo_loss(y_true, y_pred):

    llbox = y_true[:,0,:lms[0],:lms[0],:]

    lmbox = y_true[:,1,:lms[1],:lms[1],:]

    lsbox = y_true[:,2,...]

    

    plbox = y_pred[:,0,:lms[0],:lms[0],:]

    pmbox = y_pred[:,1,:lms[1],:lms[1],:]

    psbox = y_pred[:,2,...]


    lloss = inloss(llbox, plbox, lms[0])

    mloss = inloss(lmbox, pmbox, lms[1])

    sloss = inloss(lsbox, psbox, lms[2])

    

    return lloss + mloss + sloss


lr = tf.Variable(1e-2)

decay = (1e-1)**(1/7)

oldloss = tf.Variable(100.)

class printLR(tf.keras.callbacks.Callback):

    def on_epoch_begin(self, epoch, logs=None):

        lra = self.model.optimizer.lr(epoch)

        print('lr:', lra.numpy())

        return


    def on_epoch_end(self, epoch, logs=None):

        logs = logs or {}

        newloss = logs.get("loss")

        if lr < 1e-6:

            lr.assign(1e-3)

        elif oldloss - newloss < oldloss * .01:

            lr.assign(lr * decay)

            

        oldloss.assign(newloss)

        return


file_name = '/whatever you want to call the weights file.hdf5'

mcp_save = tf.keras.callbacks.ModelCheckpoint(wd+file_name, save_best_only=True, monitor='loss', save_weights_only=True)


epochs = 100

model.compile(loss = yolo_loss, optimizer=tf.keras.optimizers.SGD(learning_rate = tf.keras.optimizers.schedules.CosineDecay(initial_learning_rate = lr, decay_steps = 1, alpha = 1e0)

))


model.fit(coco, epochs = epochs, callbacks=[printLR(), mcp_save], 

#           validation_data = val,

)

#Output

lr: 0.01

Epoch 1/100

1271/1271 [==============================] - 202s 147ms/step - loss: 3.7085

lr: 0.01

Epoch 2/100

1271/1271 [==============================] - 195s 148ms/step - loss: 3.5942

lr: 0.01

Epoch 3/100

1271/1271 [==============================] - 194s 147ms/step - loss: 3.5219

lr: 0.01

Epoch 4/100

1271/1271 [==============================] - 197s 149ms/step - loss: 3.4529

lr: 0.01

Epoch 5/100

1271/1271 [==============================] - 202s 153ms/step - loss: 3.3877

lr: 0.01

Epoch 6/100

1271/1271 [==============================] - 201s 152ms/step - loss: 3.3458

lr: 0.01

Epoch 7/100

1271/1271 [==============================] - 196s 148ms/step - loss: 3.2876

lr: 0.01

Epoch 8/100

1271/1271 [==============================] - 194s 147ms/step - loss: 3.2399

lr: 0.01

Epoch 9/100

1271/1271 [==============================] - 194s 147ms/step - loss: 3.2049

lr: 0.01

Epoch 10/100

1271/1271 [==============================] - 196s 149ms/step - loss: 3.1658

lr: 0.01

Epoch 11/100

1271/1271 [==============================] - 194s 147ms/step - loss: 3.1256

lr: 0.01

Epoch 12/100

1271/1271 [==============================] - 195s 148ms/step - loss: 3.0926

lr: 0.01

Epoch 13/100

1271/1271 [==============================] - 195s 148ms/step - loss: 3.0502

lr: 0.01

Epoch 14/100

1271/1271 [==============================] - 195s 148ms/step - loss: 3.0237

lr: 0.0071968567

Epoch 15/100

1271/1271 [==============================] - 197s 150ms/step - loss: 2.9624

lr: 0.0071968567

Epoch 16/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.9221

lr: 0.0071968567

Epoch 17/100

1271/1271 [==============================] - 197s 147ms/step - loss: 2.8948

lr: 0.0051794746

Epoch 18/100

1271/1271 [==============================] - 198s 148ms/step - loss: 2.8256

lr: 0.0051794746

Epoch 19/100

1271/1271 [==============================] - 200s 152ms/step - loss: 2.7890

lr: 0.0051794746

Epoch 20/100

1271/1271 [==============================] - 196s 148ms/step - loss: 2.7620

lr: 0.0037275937

Epoch 21/100

1271/1271 [==============================] - 200s 152ms/step - loss: 2.6963

lr: 0.0037275937

Epoch 22/100

1271/1271 [==============================] - 197s 149ms/step - loss: 2.6645

lr: 0.0037275937

Epoch 23/100

1271/1271 [==============================] - 198s 150ms/step - loss: 2.6352

lr: 0.0037275937

Epoch 24/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.6058

lr: 0.0037275937

Epoch 25/100

1271/1271 [==============================] - 198s 151ms/step - loss: 2.5866

lr: 0.0026826959

Epoch 26/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.5188

lr: 0.0026826959

Epoch 27/100

1271/1271 [==============================] - 198s 149ms/step - loss: 2.4802

lr: 0.0026826959

Epoch 28/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.4646

lr: 0.0019306978

Epoch 29/100

1271/1271 [==============================] - 202s 152ms/step - loss: 2.4079

lr: 0.0019306978

Epoch 30/100

1271/1271 [==============================] - 198s 150ms/step - loss: 2.3688

lr: 0.0019306978

Epoch 31/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.3503

lr: 0.0013894956

Epoch 32/100

1271/1271 [==============================] - 200s 152ms/step - loss: 2.3160

lr: 0.0013894956

Epoch 33/100

1271/1271 [==============================] - 198s 151ms/step - loss: 2.2775

lr: 0.0013894956

Epoch 34/100

1271/1271 [==============================] - 205s 154ms/step - loss: 2.2632

lr: 0.001

Epoch 35/100

1271/1271 [==============================] - 205s 152ms/step - loss: 2.2374

lr: 0.001

Epoch 36/100

1271/1271 [==============================] - 196s 149ms/step - loss: 2.2057

lr: 0.001

Epoch 37/100

1271/1271 [==============================] - 200s 152ms/step - loss: 2.1920

lr: 0.0007196857

Epoch 38/100

1271/1271 [==============================] - 203s 153ms/step - loss: 2.1703

lr: 0.0005179475

Epoch 39/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.1455

lr: 0.0005179475

Epoch 40/100

1271/1271 [==============================] - 203s 152ms/step - loss: 2.1303

lr: 0.0003727594

Epoch 41/100

1271/1271 [==============================] - 207s 154ms/step - loss: 2.1197

lr: 0.0002682696

Epoch 42/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.1059

lr: 0.00019306978

Epoch 43/100

1271/1271 [==============================] - 202s 153ms/step - loss: 2.0956

lr: 0.00013894956

Epoch 44/100

1271/1271 [==============================] - 203s 153ms/step - loss: 2.0896

lr: 0.000100000005

Epoch 45/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.0829

lr: 7.196857e-05

Epoch 46/100

1271/1271 [==============================] - 203s 153ms/step - loss: 2.0824

lr: 5.179475e-05

Epoch 47/100

1271/1271 [==============================] - 200s 152ms/step - loss: 2.0817

lr: 3.727594e-05

Epoch 48/100

1271/1271 [==============================] - 196s 149ms/step - loss: 2.0751

lr: 2.682696e-05

Epoch 49/100

1271/1271 [==============================] - 197s 150ms/step - loss: 2.0736

lr: 1.9306979e-05

Epoch 50/100

1271/1271 [==============================] - 202s 150ms/step - loss: 2.0768

lr: 1.3894956e-05

Epoch 51/100

1271/1271 [==============================] - 195s 148ms/step - loss: 2.0755

lr: 1.0000001e-05

Epoch 52/100

1271/1271 [==============================] - 201s 152ms/step - loss: 2.0738

lr: 7.196857e-06

Epoch 53/100

1271/1271 [==============================] - 196s 148ms/step - loss: 2.0727

lr: 5.179475e-06

Epoch 54/100

1271/1271 [==============================] - 202s 149ms/step - loss: 2.0749

lr: 3.727594e-06

Epoch 55/100

1271/1271 [==============================] - 196s 148ms/step - loss: 2.0722

lr: 2.682696e-06

Epoch 56/100

1271/1271 [==============================] - 199s 151ms/step - loss: 2.0687

lr: 1.9306979e-06

Epoch 57/100

1271/1271 [==============================] - 202s 153ms/step - loss: 2.0731

lr: 1.3894955e-06

Epoch 58/100

1271/1271 [==============================] - 202s 153ms/step - loss: 2.0723

lr: 1e-06

Epoch 59/100

1271/1271 [==============================] - 201s 152ms/step - loss: 2.0687

lr: 7.1968566e-07

Epoch 60/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.0789

lr: 0.001

Epoch 61/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.1174

lr: 0.0007196857

Epoch 62/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.1042

lr: 0.0005179475

Epoch 63/100

1271/1271 [==============================] - 204s 153ms/step - loss: 2.0840

lr: 0.0003727594

Epoch 64/100

1271/1271 [==============================] - 205s 154ms/step - loss: 2.0630

lr: 0.0003727594

Epoch 65/100

1271/1271 [==============================] - 201s 152ms/step - loss: 2.0498

lr: 0.0002682696

Epoch 66/100

1271/1271 [==============================] - 203s 154ms/step - loss: 2.0465

lr: 0.00019306978

Epoch 67/100

1271/1271 [==============================] - 203s 154ms/step - loss: 2.0320

lr: 0.00013894956

Epoch 68/100

1271/1271 [==============================] - 204s 153ms/step - loss: 2.0263

lr: 0.000100000005

Epoch 69/100

1271/1271 [==============================] - 206s 153ms/step - loss: 2.0242

lr: 7.196857e-05

Epoch 70/100

1271/1271 [==============================] - 203s 153ms/step - loss: 2.0198

lr: 5.179475e-05

Epoch 71/100

1271/1271 [==============================] - 199s 150ms/step - loss: 2.0173

lr: 3.727594e-05

Epoch 72/100

1271/1271 [==============================] - 204s 154ms/step - loss: 2.0122

lr: 2.682696e-05

Epoch 73/100

1271/1271 [==============================] - 205s 155ms/step - loss: 2.0166

lr: 1.9306979e-05

Epoch 74/100

1271/1271 [==============================] - 201s 152ms/step - loss: 2.0131

lr: 1.3894956e-05

Epoch 75/100

1271/1271 [==============================] - 207s 155ms/step - loss: 2.0080

lr: 1.0000001e-05

Epoch 76/100

1271/1271 [==============================] - 205s 155ms/step - loss: 2.0101

lr: 7.196857e-06

Epoch 77/100

1271/1271 [==============================] - 201s 151ms/step - loss: 2.0089

lr: 5.179475e-06

Epoch 78/100

1271/1271 [==============================] - 200s 151ms/step - loss: 2.0112

lr: 3.727594e-06

Epoch 79/100

1271/1271 [==============================] - 200s 151ms/step - loss: 2.0081

lr: 2.682696e-06

Epoch 80/100

1271/1271 [==============================] - 207s 154ms/step - loss: 2.0096

The above VOC images show some highlighted predictions on a completely different dataset than the COCO dataset we trained on. These samples were definitely cherry picked, but improving the model can be done simply through using the entire COCO dataset and using YOLOv8 model with many more layers like the YOLO XL. Using regularization and using the swish activation function rather than the relu activation function would also improve the model easily.

The above images show the heatmap for the fpn classification outputs for the TV monitor classification. Red signifies higher classification prediction scores for that class. The image at a certain scale will only provide a bounding box of an object if the classification score for the output of any class reaches a threshhold of 0.99.

The above images show the heatmap for the fpn classification outputs for the dining table classification. Red signifies higher classification prediction scores for that class. The image at a certain scale will only provide a bounding box of an object if the classification score for the output of any class reaches a threshhold of 0.99. Other predictions may not perform quite as well, but the fact that the model can predict images with a decent level of accuracy proves that the model is actually learning object detection.