How does Yolo find a correct Bounding Box when it looks at an image broken up by grid? The trick is that it assigns the ground-truth bounding box for one object in an image to only one grid cell in the training image. So, only one grid cell is meant to locate the object. Now, how is this grid cell chosen? For each training image, we locate the midpoint of each object in the image and then we assign the true bounding box to the grid cell that contains that midpoint. Let’s see an example. In this image of a person, we locate the midpoint of this person indicated by the yellow dot. This dot contained by one grid cell, so we assign the ground-truth bounding box to this grid cell alone and the ground truth vector for this grid cell which will be used for training will look like this and even though this other grid cell also contains part of the person, it’s ground-truth vector will look like this. We treat it as if it does not contain an object and its PC value equal zero. Now let’s see how we determine the numerical values of X, Y, W, and H. In the Yolo algorithm, X and Y determine the coordinates of the center of the bounding box relative to the grid cell and W and H determine the width and the height of the box relative to the whole image. The convention is that the upper left corner of a grid cell has coordinates 0,0 or the bottom right hand corner has coordinates 1,1. So in this example, the centre point relative to the grid cell coordinate system is X equal to about 0.5 and Y equal to 0.3. Now the width of the predicted bounding box W is 0.1 because its width is about 10 percent the width of the entire image and the height H is 0.4 because its height is about 40 percent the height of the entire image. Notice that in this system, all bounding box coordinate values fall between zero and one and the width and height of the bounding box can be bigger than the size of the grid cell. This technique is very similar to normalization. By standardizing the range of these values, this algorithm becomes easier to train and converge to a smaller error. But there’s one problem with this method. Imagine that a network has trained on a fine grid and only one small grid cell has a true bounding box for an object in an image, what do you think will happen once a network like this sees a new test image with an object in it?