Object Detection from Scratch
November 6, 2023
Though it is much easier to discard many machine learning techniques when using a pre-trained model, it is certainly not the case when starting from scratch using randomized weights. In fact, training an object detection model from scratch can be so difficult that Microsoft does not even recommend trying. I have found that training from scratch is essentially a completely different task than using a pre-trained model.
I cannot overemphasize how sensitive a model can be when training from scratch relative to using a pretrained model. This is why we must be careful with the data coming into the model. The YOLOv3 paper proposes that the COCO dataset may have better labeling accuracy than the smaller VOC dataset. In fact, since YOLOv2, research in object detection has migrated towards the COCO dataset, which has 80 classifications and well over 160,000 labeled images.
Like the VOC dataset, the outliers in the COCO dataset are very large. It also turns out that COCO classifications are severely imbalanced with a large majority of images containing people. Training from the imbalanced dataset causes many false positives towards the classifications that dominate the dataset. We will first have to reduce the number of the large outlier classifications. We can do this by removing all the images with this classification present, which will not be enough. Then we will do the same thing again with the images with people present with the other large outlier classifications. This doesn’t perfectly balance the dataset, but it makes the classification distribution much easier to work with.
For this article we must encode the labels in such a way that is conducive to learning. The encoding we used in the LAST ARTICLE included the x-y coordinate of the object center and the width and height of the object as a ratio of the width and height of the image. A better way to facilitate learning from scratch is to encode all 4 object coordinates as distances from grid points across the image. The benefit of this is that the model learns spatial awareness and it is able to make multiple predictions for a single object using distance coordinates that depend on the grid point that is within the object bounding box. Like YOLOv1, the image is divided into grid cells. But instead, we can take these grid point locations within an object bounding box, and measure the distance between a certain grid point location and the left edge of the bounding box; we can do the same thing between that certain grid point location and the top edge, right edge, and bottom edge of the bounding box of the object. In this way, for the same object, different grid point locations will come with a completely different set of coordinates. This is how the model is able to learn spatial awareness within the image.
We will be using the majority of the YOLOv8 structure which implements a feature pyramid network and makes predictions across 3 scales. This means that the label output of the image will have 3 separate predictions, which comprise 3 semi-overlapping pathways for backpropagation. This can be very complicated, yet I have chosen to use it simply because it is effective with the difficult task of training from scratch. We have scaled down the input size to 320x320 because it is possible to train a model from scratch without data augmentation like mosaic. This results in three different prediction scales of 10x10, 20x20, and 40x40. In theory, the higher resolution scales are supposed to be better at predicting small objects, and the lower resolution scales are supposed to be better at predicting larger objects.
Lastly, the loss function will need to be completely upgraded compared to the pre-trained model from the LAST ARTICLE. Firstly, the location of the object is related to the classification of the object. Additionally, while decoding the output labels, the location of the object is dependent on the classification score of the object. An object is decoded as present if it has reached a certain classification score. The bounding box of an object should be based on the bounding box coordinates the model has learned for a particular classification. Therefore, it isn’t completely accurate to structure them as separate tasks. Therefore, the loss function used in this article will be cross trained. The classification loss of the object at a certain location will be dependent on the localization score of the predicted object at that location. The localization loss of an object at a certain location will be dependent on the classification score of the correct object class. An additional benefit of this cross training loss is that it reduces the need for hyper-parameter tuning of box regression loss and classification loss weight factoring.
For the object localization loss function, we will be using complete intersection over union loss (CIOU). This loss function, measures the intersection between the prediction and the object in order to best propagate the loss to the bounding box coordinates. This video (https://www.youtube.com/watch?v=4wXXNQ4Ylrk&list=LL&index=13) sufficiently explains why certain types of intersection over union loss is superior to L1,L2, and other IOU losses. Classification loss will be based on focal loss used in the LAST ARTICLE.
Once the hyper-parameters are set like in the LAST ARTICLE, the model is ready to be trained. As a final note, the model created in this article is meant to display the bare necessities of training a model from scratch. It is meant to give insight on which machine learning techniques are essential and which are over-emphasized yet can be discarded. This model can achieve learning without regularization, anchor boxes, a distinct confidence score, typical “xywh” encoding, data augmentation, or complicated learning rate schedules. This model can even perform significant learning simply using RELU activation functions and the least amount of convolutional filters according to YOLOv8 nano. See code below.
import tensorflow as tf
import tensorflow_datasets as tfds
wd = ‘the filepath on your computer where you’re working files are in’
coco = tfds.load('coco', split='train[:40672]', shuffle_files = True, data_dir= wd) #only used portion of dataset; you can choose to use the whole thing for better accuracy
# val = tfds.load('coco', split='validation', shuffle_files = True, data_dir= wd) #can choose to check val accuracy
presize = 320
lms = [10,20,40]
classes = 80
coords = 4
batch_size = 32
gamma = 5.
d = 1/3 #yolov8 nano kernel sizes
w = 1/4
r = 2.
def imglab0(x): #standard image
image= x['image']
image = tf.image.resize(image, [presize,presize]) / 255.
b = x['objects']['bbox']
c = x['objects']['label']
for ii,jj in enumerate(lms):
gsx = jj
gsy = jj
gsize = jj
tfbase = tf.zeros([gsx,gsy,classes+coords], dtype = tf.float32)
for i in range(len(b)):
cls = int(c[i])
ymin = b[i,0] #for whatever reason, reversed yx-order compared to documentation
xmin = b[i,1]
ymax = b[i,2]
xmax = b[i,3]
w = xmax - xmin
h = ymax - ymin
gsxmin = 1 + (xmin * gsx // 1) # this ensures positive ltrb outputs
gsxmax = xmax * gsx // 1
gsymin = 1 + ymin * gsy // 1
gsymax = ymax * gsy // 1
xgrid = tf.tile(tf.reshape(tf.range(gsx, dtype = tf.float32) , [gsx, 1, 1]), [1, gsy, 1])
ygrid = tf.tile(tf.reshape(tf.range(gsy, dtype = tf.float32) , [1, gsy, 1]), [gsx, 1, 1])
xmask = tf.where((xgrid < gsxmin) | (xgrid > gsxmax), x = 0., y= 1.)
ymask = tf.where((ygrid < gsymin) | (ygrid > gsymax), x = 0., y= 1.)
mask = xmask * ymask
coh = tf.tile(tf.reshape(tf.one_hot(cls, classes, dtype = tf.float32), [1, 1, classes]), [gsx, gsy, 1])
cmask = mask * coh
left = mask * (xgrid/gsx - xmin)
top = mask * (ygrid/gsy - ymin)
right = mask * (xmax - xgrid/gsx)
bottom = mask * (ymax - ygrid/gsy)
mbase = tf.concat([cmask, left, top, right, bottom], axis = -1)
tfbb = (tfbase[..., classes:classes+1] + tfbase[..., classes+2:classes+3]) * (tfbase[..., classes+1:classes+2] + tfbase[..., classes+3:classes+4])
tfob = tf.reduce_max(tfbase[...,:classes], axis = -1, keepdims = True)
tfnoob = (1 - tfob)
mbb = (left + right) * (top + bottom)
mbob = mask
mbnoob = (1 - mask)
tferase = mbnoob * tf.where(mbb * tfob * mbob < tfbb * tfob * mbob, x = tf.cast(0, dtype = tf.float32), y = tf.cast(1, dtype = tf.float32)) #if bb is less than tfbb erase from tfbase, if bb is greater than tfbb keep in tfbase
mberase = tf.maximum((1. - tferase), (mbob * tfnoob))
tfbase = tferase * tfbase + mberase * mbase
if jj == lms[0]:
lbox = tfbase
elif jj == lms[1]:
mbox = tfbase
else:
sbox = tfbase
lbox = tf.image.pad_to_bounding_box(lbox, 0,0, lms[-1], lms[-1])
lbox = tf.expand_dims(lbox, axis = 0)
mbox = tf.image.pad_to_bounding_box(mbox, 0,0, lms[-1], lms[-1])
mbox = tf.expand_dims(mbox, axis = 0)
sbox = tf.expand_dims(sbox, axis = 0)
return image, tf.concat([lbox,mbox,sbox], axis = 0)
def numx(x):
t = x['bbox'][:,-1]
t = tf.where(t == 0, x=1, y=0)
return tf.reduce_sum(t) < 2
person = coco.filter(lambda x: tf.reduce_any(x['objects']['label'][:,-1] == 0) and numx(x) and (tf.reduce_any(x['objects']['label'][:,-1] == 56) or tf.reduce_any(x['objects']['label'][:,-1] == 2) or tf.reduce_any(x['objects']['label'][:,-1] == 60)) ).take(5000)
rest = coco.filter(lambda x: tf.reduce_all(x['objects']['label'][:,-1] != 0) )
findata = rest.concatenate(person)
coco = findata.shuffle(8000, reshuffle_each_iteration=True).map(imglab0, num_parallel_calls=tf.data.AUTOTUNE).batch(batch_size, drop_remainder=True).prefetch(tf.data.AUTOTUNE)
from tensorflow.keras.layers import Conv2D, Input, BatchNormalization, MaxPooling2D, ZeroPadding2D
def convolutional(input_layer, filters, kernel_size, downsample=False):
if downsample:
input_layer = ZeroPadding2D(((0, 1), (0, 1)))(input_layer)
padding = 'valid'
strides = 2
else:
strides = 1
padding = 'same'
conv = Conv2D(filters=filters, kernel_size=kernel_size, strides=strides,
padding=padding, #use_bias=False, kernel_regularizer=l2(0.0005),
kernel_initializer='he_normal',
)(input_layer)
conv = BatchNormalization()(conv)
conv = tf.keras.activations.relu(conv)
return conv
def bottle(input_layer, shortcut = True):
res = input_layer
filters = input_layer.shape[-1]
conv = convolutional(input_layer, filters = filters, kernel_size = 3)
conv = convolutional(conv, filters = filters, kernel_size = 3)
if shortcut:
conv = res + conv
return conv
def c2f(input_layer, filters, shortcut = True):
res = convolutional(input_layer, filters = filters, kernel_size = 1)
res0, conv = tf.split(res, 2, axis = -1)
n = int(6 * d)
out = tf.concat([res0,conv], axis = -1)
for i in range(n):
conv = bottle(conv, shortcut = shortcut)
out = tf.concat([out,conv], axis = -1)
route = convolutional(out, filters = filters, kernel_size = 1)
return route
def sppf(input_layer):
route = convolutional(input_layer, filters = int(512*w*r/4), kernel_size = 1)
mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(route)
route = tf.concat([route,mp], axis = -1)
mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(mp)
route = tf.concat([route,mp], axis = -1)
mp = MaxPooling2D(pool_size = 5, strides = 1, padding='same')(mp)
route = tf.concat([route,mp], axis = -1)
route = convolutional(route, filters = int(512*w*r), kernel_size = 1)
return route
def detect(input_layer):
cls = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head
cls = convolutional(cls, 256, kernel_size = 3)
cls = Conv2D(filters = classes, kernel_size = 1, strides = 1)(cls)
bbox = convolutional(input_layer, 256, kernel_size = 3) #filters from yolox decoupled head
bbox = convolutional(bbox, 256, kernel_size = 3)
bbox = Conv2D(filters = coords, kernel_size = 1, strides = 1)(bbox)
return tf.concat([cls, bbox], axis = -1)
def fpn(lroute,mroute,sroute):
route = sppf(lroute)
lroute = route
route = tf.image.resize(route, [route.shape[1] * 2, route.shape[2] * 2], method = 'nearest')
route = tf.concat([route, mroute], axis = -1)
route = c2f(route, filters = int(512*w), shortcut = False)
mroute = route
route = tf.image.resize(route, [route.shape[1] * 2, route.shape[2] * 2], method = 'nearest')
route = tf.concat([route, sroute], axis = -1)
route = c2f(route, filters = int(256*w), shortcut = False)
sroute = route
sroute = detect(sroute)
route = convolutional(route, filters = int(256*w), kernel_size = 3, downsample = True)
route = tf.concat([route, mroute], axis = -1)
route = c2f(route, filters = int(512*w), shortcut = False)
mroute = route
mroute = detect(mroute)
route = convolutional(route, filters = int(512*w), kernel_size = 3, downsample = True)
route = tf.concat([route, lroute], axis = -1)
lroute = c2f(route, filters = int(512*w*r), shortcut= False)
lroute = detect(lroute)
lroute = tf.image.pad_to_bounding_box(lroute, 0,0, lms[-1], lms[-1])
lroute = tf.expand_dims(lroute, axis = 1)
mroute = tf.image.pad_to_bounding_box(mroute, 0,0, lms[-1], lms[-1])
mroute = tf.expand_dims(mroute, axis = 1)
sroute = tf.expand_dims(sroute, axis = 1)
outputs = tf.concat([lroute,mroute,sroute], axis = 1) #[batch_size, lms, gsize, gsize, classes+coords+conf]
return outputs
inputs = Input([presize,presize,3]) #yolov8 scratch block
route = convolutional(inputs, filters = int(64*w), kernel_size = 3, downsample = True)
route = convolutional(route, filters = int(128*w), kernel_size = 3, downsample = True)
route = c2f(route, filters = int(128*w), shortcut = True)
route = convolutional(route, filters = int(256*w), kernel_size = 3, downsample= True)
route = c2f(route, filters = int(256*w), shortcut = True)
sroute = route
route = convolutional(route, filters = int(512*w), kernel_size = 3, downsample = True)
route = c2f(route, filters = int(512*w), shortcut = True)
mroute = route
route = convolutional(route, filters = int(512*w*r), kernel_size = 3, downsample = True)
route = c2f(route, filters = int(512*w*r), shortcut = True)
lroute = route
outputs = fpn(lroute,mroute,sroute)
model = tf.keras.Model(inputs, outputs)
def bbox_ciou(b_true, b_pred, gsize):
gsx, gsy = gsize, gsize
xgrid = tf.tile(tf.reshape(tf.range(gsx, dtype = tf.float32) , [1, gsx, 1, 1]), [batch_size, 1, gsy, 1])
ygrid = tf.tile(tf.reshape(tf.range(gsy, dtype = tf.float32) , [1, 1, gsy, 1]), [batch_size, gsx, 1, 1])
lxmin = tf.maximum(0., xgrid/gsx - b_true[..., 0:1])
lymin = tf.maximum(0., ygrid/gsy - b_true[..., 1:2])
lxmax = tf.minimum(1., xgrid/gsx + b_true[..., 2:3])
lymax = tf.minimum(1., ygrid/gsy + b_true[..., 3:4])
b_true_w = (lxmax - lxmin)
b_true_h = (lymax - lymin)
pxmin = tf.maximum(0., xgrid/gsx - b_pred[..., 0:1])
pymin = tf.maximum(0., ygrid/gsy - b_pred[..., 1:2])
pxmax = tf.minimum(1., xgrid/gsx + b_pred[..., 2:3])
pymax = tf.minimum(1., ygrid/gsy + b_pred[..., 3:4])
b_pred_w = (pxmax - pxmin)
b_pred_h = (pymax - pymin)
b_true_mins = tf.concat([lxmin, lymin], axis = -1)
b_true_maxes = tf.concat([lxmax, lymax], axis = -1)
b_pred_mins = tf.concat([pxmin, pymin], axis = -1)
b_pred_maxes = tf.concat([pxmax, pymax], axis = -1)
intersect_mins = tf.maximum(b_true_mins, b_pred_mins)
intersect_maxes = tf.minimum(b_true_maxes, b_pred_maxes)
intersect_wh = tf.maximum(intersect_maxes - intersect_mins, 0.)
intersect_area = intersect_wh[..., 0:1] * intersect_wh[..., 1:2]
b_true_area = b_true_w * b_true_h
b_pred_area = b_pred_w * b_pred_h
union_area = b_true_area + b_pred_area - intersect_area
# calculate IoU, add epsilon in denominator to avoid dividing by 0
iou = tf.math.divide_no_nan(intersect_area, union_area)
# get enclosed area
enclose_mins = tf.minimum(b_true_mins, b_pred_mins)
enclose_maxes = tf.maximum(b_true_maxes, b_pred_maxes)
enclose_wh = tf.maximum(enclose_maxes - enclose_mins, 0.)
# box center distance
b_true_x = tf.reduce_mean(tf.concat([lxmin, lxmax], axis = -1), axis = -1, keepdims = True)
b_true_y = tf.reduce_mean(tf.concat([lymin, lymax], axis = -1), axis = -1, keepdims = True)
b_true_xy = tf.concat([b_true_x, b_true_y], axis = -1)
b_pred_x = tf.reduce_mean(tf.concat([pxmin, pxmax], axis = -1), axis = -1, keepdims = True)
b_pred_y = tf.reduce_mean(tf.concat([pymin, pymax], axis = -1), axis = -1, keepdims = True)
b_pred_xy = tf.concat([b_pred_x, b_pred_y], axis = -1)
center_distance = tf.reduce_sum(tf.square(b_true_xy - b_pred_xy), axis = -1, keepdims = True)
# get enclosed diagonal distance
enclose_diagonal = tf.reduce_sum(tf.square(enclose_wh), axis = -1, keepdims = True)
# calculate DIoU, add epsilon in denominator to avoid dividing by 0
diou = iou - tf.math.divide_no_nan(center_distance, enclose_diagonal)
b_true_area = (b_true[...,2] - b_true[...,0]) * (b_true[...,3] - b_true[...,1])
b_pred_area = (b_pred[...,2] - b_pred[...,0]) * (b_pred[...,3] - b_pred[...,1])
pi = 3.14159265359
v = (4 / pi ** 2) * tf.square(tf.math.atan2(b_true_w, b_true_h) - tf.math.atan2(b_pred_w, b_pred_h))
alpha = tf.math.divide_no_nan(v, ((1.0) - iou + v))
ciou = diou - alpha*v
return ciou
def inloss(y_true, y_pred, gsize):
object_mask = tf.where(tf.reduce_max(y_true[...,:classes], axis=-1, keepdims= True) > 0., x = 1., y = 0.)
area = tf.expand_dims((y_true[...,classes] + y_true[...,classes+2]) * (y_true[...,classes+1] + y_true[...,classes+3]), axis = -1)
pred_ltrb = tf.nn.sigmoid(y_pred[...,classes:classes+coords])
label_ltrb = y_true[...,classes:classes+coords]
track = tf.reduce_max(y_true[...,:classes] * tf.nn.sigmoid(y_pred[...,:classes]), axis = -1, keepdims = True)
ciou = bbox_ciou(label_ltrb, pred_ltrb, gsize)
classciou = y_true[...,:classes] * ciou + (1. - y_true[...,:classes]) #cross train only active only for positive samples
bbox_loss_scale = (2 - area)
box_loss = object_mask * bbox_loss_scale * (1. - ciou*track)
box_loss = tf.math.divide_no_nan(tf.reduce_sum(box_loss), tf.reduce_sum(object_mask))
pred_prob = tf.sigmoid(y_pred[...,:classes])
pred_prob = pred_prob * classciou #cross train
mod_factor = y_true[...,:classes] * tf.pow(1. - pred_prob, gamma) + (1. - y_true[...,:classes]) * tf.pow(pred_prob, gamma)
class_loss = mod_factor * tf.keras.losses.BinaryCrossentropy(axis = -1, reduction = 'none')(y_true = tf.expand_dims(y_true[...,:classes], axis = -1), y_pred = tf.expand_dims(pred_prob, axis = -1))
class_loss = tf.math.divide_no_nan(tf.reduce_sum(tf.reduce_mean(class_loss, axis = -1)), tf.reduce_sum(object_mask))
return class_loss + box_loss
def yolo_loss(y_true, y_pred):
llbox = y_true[:,0,:lms[0],:lms[0],:]
lmbox = y_true[:,1,:lms[1],:lms[1],:]
lsbox = y_true[:,2,...]
plbox = y_pred[:,0,:lms[0],:lms[0],:]
pmbox = y_pred[:,1,:lms[1],:lms[1],:]
psbox = y_pred[:,2,...]
lloss = inloss(llbox, plbox, lms[0])
mloss = inloss(lmbox, pmbox, lms[1])
sloss = inloss(lsbox, psbox, lms[2])
return lloss + mloss + sloss
lr = tf.Variable(1e-2)
decay = (1e-1)**(1/7)
oldloss = tf.Variable(100.)
class printLR(tf.keras.callbacks.Callback):
def on_epoch_begin(self, epoch, logs=None):
lra = self.model.optimizer.lr(epoch)
print('lr:', lra.numpy())
return
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
newloss = logs.get("loss")
if lr < 1e-6:
lr.assign(1e-3)
elif oldloss - newloss < oldloss * .01:
lr.assign(lr * decay)
oldloss.assign(newloss)
return
file_name = '/whatever you want to call the weights file.hdf5'
mcp_save = tf.keras.callbacks.ModelCheckpoint(wd+file_name, save_best_only=True, monitor='loss', save_weights_only=True)
epochs = 100
model.compile(loss = yolo_loss, optimizer=tf.keras.optimizers.SGD(learning_rate = tf.keras.optimizers.schedules.CosineDecay(initial_learning_rate = lr, decay_steps = 1, alpha = 1e0)
))
model.fit(coco, epochs = epochs, callbacks=[printLR(), mcp_save],
# validation_data = val,
)
#Output
lr: 0.01
Epoch 1/100
1271/1271 [==============================] - 202s 147ms/step - loss: 3.7085
lr: 0.01
Epoch 2/100
1271/1271 [==============================] - 195s 148ms/step - loss: 3.5942
lr: 0.01
Epoch 3/100
1271/1271 [==============================] - 194s 147ms/step - loss: 3.5219
lr: 0.01
Epoch 4/100
1271/1271 [==============================] - 197s 149ms/step - loss: 3.4529
lr: 0.01
Epoch 5/100
1271/1271 [==============================] - 202s 153ms/step - loss: 3.3877
lr: 0.01
Epoch 6/100
1271/1271 [==============================] - 201s 152ms/step - loss: 3.3458
lr: 0.01
Epoch 7/100
1271/1271 [==============================] - 196s 148ms/step - loss: 3.2876
lr: 0.01
Epoch 8/100
1271/1271 [==============================] - 194s 147ms/step - loss: 3.2399
lr: 0.01
Epoch 9/100
1271/1271 [==============================] - 194s 147ms/step - loss: 3.2049
lr: 0.01
Epoch 10/100
1271/1271 [==============================] - 196s 149ms/step - loss: 3.1658
lr: 0.01
Epoch 11/100
1271/1271 [==============================] - 194s 147ms/step - loss: 3.1256
lr: 0.01
Epoch 12/100
1271/1271 [==============================] - 195s 148ms/step - loss: 3.0926
lr: 0.01
Epoch 13/100
1271/1271 [==============================] - 195s 148ms/step - loss: 3.0502
lr: 0.01
Epoch 14/100
1271/1271 [==============================] - 195s 148ms/step - loss: 3.0237
lr: 0.0071968567
Epoch 15/100
1271/1271 [==============================] - 197s 150ms/step - loss: 2.9624
lr: 0.0071968567
Epoch 16/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.9221
lr: 0.0071968567
Epoch 17/100
1271/1271 [==============================] - 197s 147ms/step - loss: 2.8948
lr: 0.0051794746
Epoch 18/100
1271/1271 [==============================] - 198s 148ms/step - loss: 2.8256
lr: 0.0051794746
Epoch 19/100
1271/1271 [==============================] - 200s 152ms/step - loss: 2.7890
lr: 0.0051794746
Epoch 20/100
1271/1271 [==============================] - 196s 148ms/step - loss: 2.7620
lr: 0.0037275937
Epoch 21/100
1271/1271 [==============================] - 200s 152ms/step - loss: 2.6963
lr: 0.0037275937
Epoch 22/100
1271/1271 [==============================] - 197s 149ms/step - loss: 2.6645
lr: 0.0037275937
Epoch 23/100
1271/1271 [==============================] - 198s 150ms/step - loss: 2.6352
lr: 0.0037275937
Epoch 24/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.6058
lr: 0.0037275937
Epoch 25/100
1271/1271 [==============================] - 198s 151ms/step - loss: 2.5866
lr: 0.0026826959
Epoch 26/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.5188
lr: 0.0026826959
Epoch 27/100
1271/1271 [==============================] - 198s 149ms/step - loss: 2.4802
lr: 0.0026826959
Epoch 28/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.4646
lr: 0.0019306978
Epoch 29/100
1271/1271 [==============================] - 202s 152ms/step - loss: 2.4079
lr: 0.0019306978
Epoch 30/100
1271/1271 [==============================] - 198s 150ms/step - loss: 2.3688
lr: 0.0019306978
Epoch 31/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.3503
lr: 0.0013894956
Epoch 32/100
1271/1271 [==============================] - 200s 152ms/step - loss: 2.3160
lr: 0.0013894956
Epoch 33/100
1271/1271 [==============================] - 198s 151ms/step - loss: 2.2775
lr: 0.0013894956
Epoch 34/100
1271/1271 [==============================] - 205s 154ms/step - loss: 2.2632
lr: 0.001
Epoch 35/100
1271/1271 [==============================] - 205s 152ms/step - loss: 2.2374
lr: 0.001
Epoch 36/100
1271/1271 [==============================] - 196s 149ms/step - loss: 2.2057
lr: 0.001
Epoch 37/100
1271/1271 [==============================] - 200s 152ms/step - loss: 2.1920
lr: 0.0007196857
Epoch 38/100
1271/1271 [==============================] - 203s 153ms/step - loss: 2.1703
lr: 0.0005179475
Epoch 39/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.1455
lr: 0.0005179475
Epoch 40/100
1271/1271 [==============================] - 203s 152ms/step - loss: 2.1303
lr: 0.0003727594
Epoch 41/100
1271/1271 [==============================] - 207s 154ms/step - loss: 2.1197
lr: 0.0002682696
Epoch 42/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.1059
lr: 0.00019306978
Epoch 43/100
1271/1271 [==============================] - 202s 153ms/step - loss: 2.0956
lr: 0.00013894956
Epoch 44/100
1271/1271 [==============================] - 203s 153ms/step - loss: 2.0896
lr: 0.000100000005
Epoch 45/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.0829
lr: 7.196857e-05
Epoch 46/100
1271/1271 [==============================] - 203s 153ms/step - loss: 2.0824
lr: 5.179475e-05
Epoch 47/100
1271/1271 [==============================] - 200s 152ms/step - loss: 2.0817
lr: 3.727594e-05
Epoch 48/100
1271/1271 [==============================] - 196s 149ms/step - loss: 2.0751
lr: 2.682696e-05
Epoch 49/100
1271/1271 [==============================] - 197s 150ms/step - loss: 2.0736
lr: 1.9306979e-05
Epoch 50/100
1271/1271 [==============================] - 202s 150ms/step - loss: 2.0768
lr: 1.3894956e-05
Epoch 51/100
1271/1271 [==============================] - 195s 148ms/step - loss: 2.0755
lr: 1.0000001e-05
Epoch 52/100
1271/1271 [==============================] - 201s 152ms/step - loss: 2.0738
lr: 7.196857e-06
Epoch 53/100
1271/1271 [==============================] - 196s 148ms/step - loss: 2.0727
lr: 5.179475e-06
Epoch 54/100
1271/1271 [==============================] - 202s 149ms/step - loss: 2.0749
lr: 3.727594e-06
Epoch 55/100
1271/1271 [==============================] - 196s 148ms/step - loss: 2.0722
lr: 2.682696e-06
Epoch 56/100
1271/1271 [==============================] - 199s 151ms/step - loss: 2.0687
lr: 1.9306979e-06
Epoch 57/100
1271/1271 [==============================] - 202s 153ms/step - loss: 2.0731
lr: 1.3894955e-06
Epoch 58/100
1271/1271 [==============================] - 202s 153ms/step - loss: 2.0723
lr: 1e-06
Epoch 59/100
1271/1271 [==============================] - 201s 152ms/step - loss: 2.0687
lr: 7.1968566e-07
Epoch 60/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.0789
lr: 0.001
Epoch 61/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.1174
lr: 0.0007196857
Epoch 62/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.1042
lr: 0.0005179475
Epoch 63/100
1271/1271 [==============================] - 204s 153ms/step - loss: 2.0840
lr: 0.0003727594
Epoch 64/100
1271/1271 [==============================] - 205s 154ms/step - loss: 2.0630
lr: 0.0003727594
Epoch 65/100
1271/1271 [==============================] - 201s 152ms/step - loss: 2.0498
lr: 0.0002682696
Epoch 66/100
1271/1271 [==============================] - 203s 154ms/step - loss: 2.0465
lr: 0.00019306978
Epoch 67/100
1271/1271 [==============================] - 203s 154ms/step - loss: 2.0320
lr: 0.00013894956
Epoch 68/100
1271/1271 [==============================] - 204s 153ms/step - loss: 2.0263
lr: 0.000100000005
Epoch 69/100
1271/1271 [==============================] - 206s 153ms/step - loss: 2.0242
lr: 7.196857e-05
Epoch 70/100
1271/1271 [==============================] - 203s 153ms/step - loss: 2.0198
lr: 5.179475e-05
Epoch 71/100
1271/1271 [==============================] - 199s 150ms/step - loss: 2.0173
lr: 3.727594e-05
Epoch 72/100
1271/1271 [==============================] - 204s 154ms/step - loss: 2.0122
lr: 2.682696e-05
Epoch 73/100
1271/1271 [==============================] - 205s 155ms/step - loss: 2.0166
lr: 1.9306979e-05
Epoch 74/100
1271/1271 [==============================] - 201s 152ms/step - loss: 2.0131
lr: 1.3894956e-05
Epoch 75/100
1271/1271 [==============================] - 207s 155ms/step - loss: 2.0080
lr: 1.0000001e-05
Epoch 76/100
1271/1271 [==============================] - 205s 155ms/step - loss: 2.0101
lr: 7.196857e-06
Epoch 77/100
1271/1271 [==============================] - 201s 151ms/step - loss: 2.0089
lr: 5.179475e-06
Epoch 78/100
1271/1271 [==============================] - 200s 151ms/step - loss: 2.0112
lr: 3.727594e-06
Epoch 79/100
1271/1271 [==============================] - 200s 151ms/step - loss: 2.0081
lr: 2.682696e-06
Epoch 80/100
1271/1271 [==============================] - 207s 154ms/step - loss: 2.0096
The above VOC images show some highlighted predictions on a completely different dataset than the COCO dataset we trained on. These samples were definitely cherry picked, but improving the model can be done simply through using the entire COCO dataset and using YOLOv8 model with many more layers like the YOLO XL. Using regularization and using the swish activation function rather than the relu activation function would also improve the model easily.
The above images show the heatmap for the fpn classification outputs for the TV monitor classification. Red signifies higher classification prediction scores for that class. The image at a certain scale will only provide a bounding box of an object if the classification score for the output of any class reaches a threshhold of 0.99.
The above images show the heatmap for the fpn classification outputs for the dining table classification. Red signifies higher classification prediction scores for that class. The image at a certain scale will only provide a bounding box of an object if the classification score for the output of any class reaches a threshhold of 0.99. Other predictions may not perform quite as well, but the fact that the model can predict images with a decent level of accuracy proves that the model is actually learning object detection.