So far, you’ve seen a variety of image processing techniques that play a foundational role in pattern recognition tasks, such as an image classification. You’ve seen how convolutional neural networks follow a series of steps to classify an image. Just to recap, a CNN first takes in an input image then puts that image through several convolutional and pooling layers. The result is a set of feature maps reduced in size from the original image that through a training process have learned to distill information about the content in the original image. We then flatten these feature maps, creating a vector that we can then pass to a series of fully-connected linear layers to produce a probability distribution of class scores. From this, we can extract the predicted class for the input image. So in short, an image comes in and the predicted class label comes out. In classification tasks like these, there’s usually a single object per image that a network is expected to classify. But in the real world, we’re often faced with much more complex visual scenes, scenes with many overlapping objects. We can see and classify many objects at a time, and even estimate things like the distance between objects in a scene. In this lesson, we’ll look at different kinds of CNN architectures and see how they’ve evolved over time. Specifically, we’ll look at models that detect multiple objects in a scene, like Faster R-CNN and YOLO. Two kinds of networks that can look at an image, break it up into smaller regions, and label each region with a class so that a variable number of objects in a given image can be localized and labeled. Later on in the course, you will also learn about recurrent neural networks that allow us to process and generate sequences of data, such as a sequence of image frames or a sequence of words, which can be useful should you want to describe visual scenes as in the case of automatic image captioning. So, let’s start by looking at some complex tasks that CNNs can be applied to.