6 – Stop Word Removal

Stop words are uninformative words like, is, our, the, in, at, et cetera that do not add a lot of meaning to a sentence. They are typically very commonly occurring words, and we may want to remove them to reduce the vocabulary we have to deal with and hence the complexity of later procedures. Notice that even without our and the in the sentence above, we can still infer it’s positive sentiment toward dogs. You can see for yourself which words NLTK considers to be stop words in English. Note that this is based on a specific corpus or collection of text. Different corpora may have different stop words. Also, a word maybe a stop word in one application, but a useful word in another. To remove stop words from a piece of text, you can use a Python list comprehension with a filtering condition. Here, we apply stop word removal to the movie review after normalizing and tokenizing it. The result is a little hard to read, but notice how it has helped reduce the size of the input, at the same time important words have been retained.

%d 블로거가 이것을 좋아합니다: