2-2. Topic Modeling Archives

9 – NLPND LDA 09 Dirichlet Distributions RENDER V2

2021-08-08 by Dr. Serendipity

Now, a Multinomial Distribution is simply a generalization of the binomial distribution to more than two values. For example, let’s say we have newspaper articles and three topics: science, politics, and sports. Let’s say each topic is assigned randomly to the articles, and when we look, we have three science articles, six politics articles, and … Read more

8 – Beta Distributions

2021-08-08 by Dr. Serendipity

So, let’s go to probability distributions. Let’s say we have a coin, and we toss it twice. Let’s say we get one heads and one tails. What would we think about this coin? Well, it could be a fair coin, right? It could also be slightly biased towards either heads or tails. We don’t have … Read more

7 – Solution Picking Topics

2021-08-08 by Dr. Serendipity

So let’s think. In the distribution on the left, we’re very likely to pick a point, say, here close to a corner or the edges. Let’s say, for example, close to politics. That means our article is 80 percent about politics, 10 percent about sports, and 10 percent about science. On the distribution in the … Read more

6 – Quiz Picking Topics

2021-08-08 by Dr. Serendipity

Let’s sidetrack a bit. Let’s say we’re at a party and this party is in the triangular room, and these black dots are people and they’re roaming around the party. Now let’s say we locate some food in a corner, some dessert in the other corner and some music in the other one. So people … Read more

5 – Matrices

2021-08-08 by Dr. Serendipity

So, the idea for building our LDA model will be to factor are Bag of Words matrix on the left into two matrices, one indexing documents by topic and the other indexing topics by word. In this video, I’ll be more specific about what these matrices mean. Here’s how we calculate our Bag of Words … Read more

4 – Matrix Multiplication

2021-08-08 by Dr. Serendipity

Well, let’s see. The first part has 500 times 10 parameters, which is 5,000. The second part has 10 times 1,000, which is 10,000. So, together, they have 15,000 parameters. This is much better than 500,000. This is called Latent Dirichlet Allocation or LDA for short. An LDA is an example of matrix factorization. We’ll … Read more

3 – Latent Variables

2021-08-08 by Dr. Serendipity

Well, the answer is, we need one arrow for each pair document word. Since we have 500 documents and 1,000 words, the number of parameters is the number of documents times the number of words. This is 500 times 1,000, which is 500,000. This is too many parameters to figure out. Is there any way … Read more

2 – Bag Of Words

2021-08-08 by Dr. Serendipity

Let’s start with a regular bag of words model. If you think about the Bag of Words model graphically, it represents the relationship between a set of document objects and a set of word objects. It’s very simple. Let’s say we’ve got an article like this one and we look at what words are in … Read more

16 – Outro

2021-08-08 by Dr. Serendipity

That’s it. Great job. In this lesson, we’ve learned topic model and document categorization using Latent Dirichlet Allocation. This will give us the mixture model of topics in a new document and the probabilities of these topics generating all the words. Now, in the following lab, you’ll be able to put all this in practice … Read more

15 – Combining the Models

2021-08-08 by Dr. Serendipity

So, now let’s put this all together and study how to get these two matrices in the LDA model based on their respective dirichlet distributions. The rough idea is as we just saw, the entries from the first matrix come from picking points in the distribution alpha. The interest from the second matrix come from … Read more

14 – Sample a Word

2021-08-08 by Dr. Serendipity

Now, we’ll do the same thing for topics and words. Let’s say for the sake of visualization that we only have four words: space, climate, vote, and rule. Now, we have a different Dirichlet distribution, beta. This one is similar to the previous one but it is three-dimensional, it’s not around a triangle but it’s … Read more

13 – Sample A Word

2021-08-08 by Dr. Serendipity

Now, we’ll do the same thing for topics and words. Let’s say for the sake of visualization that we only have four words: space, climate, vote, and rule. Now we have a different distribution, beta. This one is similar to the previous one but it is three-dimensional, is not around a triangle but it’s around … Read more

12 – Sample A Topic

2021-08-08 by Dr. Serendipity

Let’s start by picking some topics for our documents. We start with some Dirichlet distribution with parameters alpha, and the parameter should be small for the distribution to be spiky towards the sides, which means, if we pick a point somewhere in the distribution, it will most likely be close to a corner or at … Read more

11 – Sample A Topic

2021-08-08 by Dr. Serendipity

Let’s start by picking some topics for our documents. We start with some declared distribution with parameters Alpha. And the parameters should be small for the distribution to be spiky towards the sites. Which means, if we pick a point somewhere in the distribution, it will most likely be close to a corner or at … Read more

10 – Latent Dirichlet Alocation

2021-08-08 by Dr. Serendipity

So, now let’s build our LDA model. The idea is the following; We’ll have our documents here, let’s say these three documents, and then we’ll generate some fake documents, like this three over here. The way we generate them is with the topic model, and then what we do is we compare the generated documents … Read more

1 – Introduction

2021-08-08 by Dr. Serendipity

Hello, this is Luis. Welcome to the Topic Modeling section. While classification is an interesting supervised learning problem and a lot of tasks fall under that category, there’s a whole world of further unsupervised problems that I find fascinating. One of these is Topic Modeling. In this section, we’ll study a model, which given a … Read more