13 – NLPND POS 12 Viterbi Algorithm V2

In this video, will develop the Viterbi algorithm in detail for this example. And if you’ve seen dynamic programming, that’s exactly what we’re doing. So here’s our trellised diagram with the two probability tables, mission and transition. Let’s draw this a bit larger with all the probabilities. Recall that on each node we have the emission probabilities, so that’s the probability that the particular part of speech generated the word above. And on the edges, we have the transition probabilities which are the probability that we get from one state to the next one. For clarity, we’ve removed the edges that have probability zero but we still consider them. Now, the idea is that for each state or dot in the diagram, we will record the maximum probability of all the paths that end in that dot and generate the words that are above. So let’s start from the left. The probability that the word Jane is generated by the path that ends at the top end node is the product of the probability of getting to that node from the start node, which is three quarters, times the probability that the node generated the word Jane which is two over nine. This product is one sixth that was recorded. Likewise, the probability of the M node is one-quarter times zero which is zero and the one on the V node is zero times zero which is zero. Now, let’s go to the next word will. There are three paths that end in the top right N node. One coming from the previous N, one from the previous M and one from the previous V. We need to look at the three and pick the one with the largest probability. The probability for the path coming from the N node is one-sixth, which is a probability so far until the word Jane, times the transition probability one over nine to get to the new N node, times the emission probably one over nine for a emitting the word will. We’ll record this product as one-sixth times one-ninth times one-ninth. Likewise, for the path that comes from the M, the probability is zero times one-quarter times one-ninth. And for the path that comes from the V, it’s zero times one times one-ninth. Since two of these are zero, the largest one is the first one which is one over 486. So we will record that number as the probability of the highest probability path that ends up that point. And this is important, of the three coming edges, we will keep the one that gave the highest value. So that is blue edge over here and will delete the other ones. If two edges give us the same value, that’s completely okay. We can break ties by picking one of them randomly. Now, we’ll continue this algorithm and feel free to sing along with it or pause the video if you’d like to do the calculations in detail. For the M node, the competing probabilities are one-sixth times one-third times three-quarters zero and zero. So the winner is the first one, which has one over 24. Again, we keep the blue edge since they gave us the highest value. For the next one, it is zero since the three of them are zero. Here we can keep pretty much any edge since they all give us the same value. Since it’s zero, in order to prevent cluttering, let’s actually not keep any of the edges. We’ll continue doing this in the future when we get all zeroes. And we continue over each word always calculating the three probabilities coming from before and taking the largest one and remembering which edge gave us the largest value. Keeping that one and removing the other ones. Now, we have a probabilities and also our edges that get us to those probabilities. So how do we get our optimal path? Well, we’ll start from the end and trace backwards. Since each state only has one incoming edge, then this should give us a path that comes all the way from the beginning. And that path is the winning path. This gives us the chain of states that generates the observations with the highest probability and that’s how we conclude that the most likely tags for the sentence are noun, modal, verb, and noun. And these are correct so I hit a mark of our modal did a great job. And as you notice, we didn’t need to check three to the four paths. We only needed that label around the three times four nodes which is much faster.