Training on grid cells requires a very specific kind of training data. To train a network to output a predicted vector of class scores and box coordinates for each cell, we need to have a true vector to compare it to. So, for each training image, we have to break it into a grid, and manually sign a ground truth vector to each grid cell. Once we have this grid cell labeled training data, the second step is to design a CNN that can be trained using these vectors. Since we have a seven by 10 grid in our example, and each grid cell has an associated eight-dimensional ground truth vector, we have to design our CNN such that the output layer of the CNN is going to have a size seven by 10 by eight. We can think of this as a seven by 10 image, with a depth of eight. So, each pixel value instead of being a vector of length three as in RGB images, is an eight dimensional output vector. This way for each input grid cell, there’s an eight-dimensional output vector and the output layer of the CNN. For example, when the network sees the first grid cell, it will produce an output vector in the upper left corner of the output layer. Having defined this output shape, we can train the CNN using images and their ground truth grid vectors as input. Once the CNN has been trained, we can use it to detect and localize objects in test images. Next let’s see how this method produces accurate bounding boxes.