14 – Sample a Word

Now, we’ll do the same thing for topics and words. Let’s say for the sake of visualization that we only have four words: space, climate, vote, and rule. Now, we have a different Dirichlet distribution, beta. This one is similar to the previous one but it is three-dimensional, it’s not around a triangle but it’s around a simplex. Again, the red parts are high probability and the blue parts are low probability. If we had more words, we would still have a Dirichlet distribution except it would be in a much higher dimensional simplex. This is why we’ve picked four words because we can visualize that simplex in 3D. So, in this distribution, beta, we pick a random point and it will very likely be close to a corner or an edge. Let’s say, it’s here. This point generates the following multinomial distribution: 0.4 for space, 0.4 for climate, and 0.1 for vote and rule. This multinomial distribution will be called phi, and it represents the connections between the words and the topic. Now from this distribution, we’ll sample random words which are 40 percent likely to be space, 40 for climate, 10 for vote and 10 for rule. The words could look like this. Now, we do this for every topic. So topic one is say around here close to space and climate, topic two is here close to vote, and topic three is here close to rule. Notice that we don’t know what topics they are, we just know them as topic one, two and three. After some inspection, we can infer that topic one being close to space and climate must be science. Similarly, topic two being close to vote could be politics. Topic three being close to rule could be sports. But this is something that we’d be doing at the end of the model. As a final step, we join these three together to obtain our other matrix in the LDA model.

%d 블로거가 이것을 좋아합니다: