In classification tasks we’ve seen. We give an image to a CNN and it outputs a label for that entire image. But sometimes you want a little more information. Say where the object actually is located in the image and this is called localization. For example, say you have an image of basketball players and you want to identify the player that has possession of the basketball at this time. To complete this task, you need to locate and identify the ball and the person that’s holding it. Localization has been used in a variety of safety applications too. For example and baby monitors that check to see if a baby is safely in a crib. And localization is even in safe driving applications in which a camera checks to see if a driver is distracted or not by looking at the location of a driver in a vehicle and the location of their eyes cell phone and other items around them. Localization is important for any application that relies on looking at the proximity of two or more objects. So let’s look at a simple Localization example. This image of a cat, in addition to labeling this image as a cat we also want to locate the cat in this image. The typical way of doing this is by drawing a bounding box around that cat. This box can be thought of as a collection of coordinates that define the box, X and Y which could be the center of the box and W and H, the width and height of the box. To find these values we can use a lot of the same structure as in a typical classification CNN. One way to perform localization is to first put a given image through a series of convolutional and pooling layers and create a feature vector for that image. You keep the same fully-connected layers to perform classification and you add another fully connected layer attached to the feature vector whose job is to predict the location and size of a bounding box. I’ll call these the bounding box coordinates. In this one CNN, we have one output path whose job is to produce a class for the object pictured in an image and another who’s job is to produce the bounding box coordinates for the object. In this case we’re assuming that the input image not only has an associated true label, but that it also has a true bounding box. This way we can train our network by comparing the predicted and true values for both the classes and bounding boxes. Now we know how to use something like cross entropy loss to measure the performance of a classification model. Cross entropy operates on probabilities with values between zero and one. But for the bounding box we need something different. A function that measures the error between our predicted bounding box and a true bounding box. Next you’ll see what kinds of loss functions are appropriate for a regression problem like this, that compares quantities instead of class scores.