So, I’ve just described a computer vision pipeline that takes in a sequence of images and through a series of steps can recognize different facial expressions and emotions. But it still seems kind of mysterious. Can you talk a bit about how exactly a model like this can be trained to recognize different facial expressions? Sure. The process is similar to the pipeline you just described. We have 45 facial muscles that drive thousands of different expressions on our face. But let’s take a specific example. Let’s say we are training an algorithm to discriminate between a smile and a smirk. We collect tens of thousands of examples of people smiling – the more diverse, the better – and then tens of thousands of examples of people smirking. We feed those prerecorded images along with their labels to the system. The algorithm then looks for visual differences between the two expressions. For instance, when you smile, your teeth might show, but that’s not the case with a smirk. So, you give the model lots of examples of smiles and smirks and other facial expressions until it learns to recognize them. It sounds like how a baby learns, by lots of examples. Exactly. And similar to how humans learn, at the beginning of the training phase, the model typically performs very badly, but then it monitors the errors it makes and uses those to improve the performance each time it sees more images. After many iterations, the model converges on the right set of parameters once the error rate becomes acceptable, and that’s when we consider the model to be fully trained. Now, this is a very high-level view of how to train any machine learning model, and the details will vary based on the type of model you use and the training algorithm you choose. For instance, you can use a convolutional neural network trained using gradient descent. Next, let’s see how this computer vision pipeline works in a real-time application.