Consider this image of a dog. A single region in this image may have many different patterns that we want to detect. Consider this region for instance. This region has teeth, some whiskers, and a tongue. In that case, to understand this image, we need filters for detecting all three of these characteristics, one for each teeth, whiskers, and tongue. Recall the case of a single convolutional filter. Adding another filter is probably exactly what you’d expect, where we just populate an additional collection of nodes in the convolutional layer. This collection has its own shared set of weights that differ from the weights for the blue nodes above them. In fact, it’s common to have tens to hundreds of these collections in a convolutional layer, each corresponding to their own filter. Let’s now execute some code to see what these collections look like. After all, each is formatted in the same way as an image, namely as a matrix of values. We’ll be visualizing the output of a Jupyter Notebook. If you like, you can follow along with the link below. So, say we’re working with an image of Udacity’s self-driving car as input. Let’s use four filters, each four pixels high and four pixels wide. Recall, each filter will be convolved across the height and width of the image to produce an entire collection of nodes in the convolutional layer. In this case, since we have four filters, we’ll have four collections of nodes. In practice we will refer to each of these four collections as either feature maps or as activation maps. When we visualize these feature maps, we see that they look like filtered images. That is, we’ve taken all of the complicated, dense information in the original image, and in each of these four cases, outputted a much simpler image with less information. By peeking at the structure of the filters, you can see that the first two filters discover vertical edges, where the last two detect horizontal edges in the image. Remember that lighter values in the feature map mean that the pattern in the filter was detected in the image. So, can you match the lighter regions in each feature map with their corresponding areas in the original image? In this activation map, for instance, we can see a clear white line defining the right edge of the car. This is because all of the corresponding regions in the car image closely resemble the filter where we have a vertical line of dark pixels to the left of a vertical line of lighter pixels. If you think about it you’ll notice that edges and images appear as a line of lighter pixels next to a line of darker pixels. This image for instance contains many regions that would be discovered or detected by one of the four filters we defined before. Filters that function as edge detectors are very important in CNNs, and we’ll revisit them later. So, now we know how to understand convolutional layers that have a grayscale image as input. But, what about color images? Well, we’ve seen that grayscale images are interpreted by the computer as a 2D array. Width, height, and width. Color images are interpreted by the computer as a 3D array. With, height width, and depth. In the case of RGB images the depth is three. This 3D array is best conceptualized as a stack of three two-dimensional matrices. Where we have matrices corresponding to the red, green, and blue channels of the image. So, how do we perform a convolution on a color image. As well as the case with grayscale images, we still move a filter horizontally and vertically across the image. Only now the filter is itself three-dimensional to have a value for each color channel at each horizontal and vertical location in the image array. Just as we think of the color image as a stack of three two-dimensional matrices, you can also think of the filter as a stack of three, two-dimensional matrices. Both the color image and the filter have red, green, and blue channels. Now to obtain the values of the nodes in the feature map corresponding to this filter, we do pretty much the same thing we did before. Only now are some as over three times as many terms. We emphasize that here we’ve depicted the calculation of the value of a single node in a convolutional layer, for one filter on a color image. If we wanted to picture the case of a color image with multiple filters instead of having a single 3D array which corresponds to one filter, we would define multiple 3D arrays each defining a filter. Here we’ve depicted three filters. Each is a 3D array that you can think of as a stack of three, 2D arrays. Here’s where it starts to get really cool. You can think about each of the feature maps in a convolutional layer along the same lines as an image channel and stuck them to get a 3D array. Then, we can use this 3D array as input to still another convolutional layer to discover patterns within the patterns that we discovered in the first convolutional layer. We can then do this again to discover patterns within patterns within patterns. Remember that in some sense convolutional layers aren’t too different from the dense layers that you saw in the previous section. Dense layers are fully connected. Meaning that the nodes are connected to every node in the previous layer. Convolutional layers are locally connected. Where their nodes are connected to only a small subset of the previously S nodes. Convolutional layers also had this added parameter sharing, but in both cases with convolutional and dense layers, inference works the same way. Both have weights and biases that are initially randomly generated. So, in the case of CNNs where the weights take the form of convolutional filters, those filters are randomly generated, and so are the patterns that they’re initially designed to detect. As with ML piece when we construct a CNN we will always specify a loss function. In the case of multiclass classification this will be categorical cross entropy loss. Then as we train the model through back propagation, the filters are updated at each epic to take on values that minimize the loss function. In other words, the CNN determines what kind of patterns it needs to detect based on the loss function. We’ll visualize these patterns later and see that for instance, if our data set contains dogs, the CNN is able to on its own learn filters that look like dogs. So, with CNN’s to emphasize we won’t specify the values of the filters or tell the CNN what kind of patterns it needs to detect. These will be learned from the data.