# 3 – 03 A Convolutional Approach To Sliding Windows V3

Since objects can be anywhere in a given image, you can make sure to detect all of them by sliding a small window over the entire image and checking for objects within each of the created windows. This is the Sliding Windows approach. Let’s see how this works in detail. Suppose I’ve trained my CNN to detect my three classes; a person, a cat and a dog. Now I want to use this train CNN to detect the person in this image. The first step in sliding windows is to choose the size of your window. We want it small enough to capture any small objects in an image. Then we place our window at the beginning of the image and feed the region inside the window to the train CNN. For each region, this CNN will output a prediction. Which is this output vector y. Notice the first element in this vector, PC, is different than what we’ve seen before. PC is a probability between zero and one, then an object exists within the window at all. If no object is detected, we don’t have to proceed with trying to classify that particular region of the image. The next values in the vector are as usual. We have C1, 2 and 3, which correspond to the class of the object detected and the bounding box coordinates. In this example, we see that this first window region doesn’t contain any of the classes we’re looking for; no person, no cat and no dog. Therefore the CNN will output a vector with PC equal to zero, because no objects were detected inside the window. Then we move along and slide the window to the ray using some small stride and we repeat this process. Small strides are used to make sure that we catch any object and to determine the location of objects within a few pixels of their true location. Now, since this region also doesn’t contain any objects, it will return PC equal to zero again. We repeat this process until we cover the entire image. You might also analyze the whole image again using windows of a different size. One of the detection windows might nicely capture a small object in an image and another might better capture large objects. Here we see that when we feed the region inside this particular window to our CNN, it produces a y vector with PC equal to one, indicating that an object has been found and it indicates that the object is a person, with C1 equal to one. In reality these probability values will typically be very close but not quite equal to one because of some uncertainty. Our model also gives us the predicted bounding box coordinates which are determined by the window. The Sliding Window approach works well, but it’s a very computationally expensive because we have to scan the entire image with windows of different sizes and each window has to be fed into the CNN. You’ve seen that one way to get around this problem is to project regions of interest in the input image to a layer deeper in the CNN into a set of feature maps. That way you can process an image through several convolutional and pooling layers just once and use the resulting feature maps to analyze different regions of the input image. But Yolo takes a different approach, and again looks at each part of an image only once without overlapping windows. How do you think you might break up an entire image so that you could analyze it without looking at one region more than once?