2 – Sentiment Analysis 1

This is the example we’ll use in this section, the IMDb Movie Reviews. We will be splitting them into two kinds. Reviews such as what a great movie, which we’ll classify as positive and review such as, that was terrible which we’ll classify as negative. For machine learning perspective, you can think of sentiment analysis are classification or a regression problem depending on whether you want to identify specific emotional categories or labels, such as positive versus negative or a real number that captures a more fine grained sentiment value. In either case, you start with a given corpus say of movie reviews. You process each review as a individual document and extract a set of features that represents it. This representation can be a direct document representation such as bag of words or TF IDF or a sequence of word vectors combined together. And then you pick classify which can be anything; a decision tree, a neural network or anything you prefer. The representation depends on what model you choose. Example, if you want to use an SVM to predict sentiment labels. You can use bag of words. But if you want to apply an RNN, you’ll need word vectors. Begin appropriate loss functions such as categorical cross entry for classification or mean squared error for regression to train your model. In more simple terms, here’s what we’ll do. We’ll take a review. For example, what a great movie and we’ll extract the words into a One-hot encoding vector. The way we’ll create the vector is by taking all the words from A to zygote and locate the ones that we have in the review. We had a one for those entries and a zero for all the other ones. And these are the vectors that we feed into our model. That’s it. Pretty simple isn’t it? So the question is, what to do with repeated words? For example in this review. Great movie, great cast, great experience. The word great appears three times. If we One-hot encode, then we’ll just write a one in that entry, even if the word appears three times. We really only care of the word great appears to see that the review is great. We can also have a bag of words approach and record the number of appearances of the word. This could make sense is if a review has the word great three times, it could just be greater than reviews that only has the word great once, right? And finally, the classifier M as we said can be anything. In the last, we’ve picked one for you. But you can change it to any other classifier you want and explore to see which ones will give you greater accuracy on this data set. And that’s all you need to know to start a [inaudible]. Good luck.

%d 블로거가 이것을 좋아합니다: