9 – Stemming And Lemmatization

In order to further simplify text data, let’s look at some ways to normalize different variations and modifications of words. Stemming is the process of reducing a word to its stem or root form. For instance, branching, branched, branches et cetera, can all be reduced to branch. After all, they conveyed the idea of something separating into multiple paths or branches. Again, this helps reduce complexity while retaining the essence of meaning that is carried by words. Stemming is meant to be a fast and crude operation carried out by applying very simple search and replace style rules. For example, the suffixes ‘ing’ and ‘ed’ can be dropped off, ‘ies’ can be replaced by ‘y’ et cetera. This may result in stem words that are not complete words, but that’s okay, as long as all forms of that word are reduced to the same stem. Thus, capturing the common underlying idea. NLTK has a few different stemmers for you to choose from, including PorterStemmer that we use here, SnowballStemmer, and other language-specific stemmers. You simply need to pass in one word at a time. Note that here, we have already removed stop words. Some of the conversions are actually pretty good, like started, reduced to start. Others, like people, losing the ‘e’ at the end are a result of applying very simplistic rules. Lemmatization is another technique used to reduce words to a normalized form, but in this case, the transformation actually uses a dictionary to map different variants of a word back to its root. With this approach, we are able to reduce non-trivial inflections such as is, was, were, back to the root ‘be’. The default lemmatizer in NLTK uses the Wordnet database to reduce words to the root form. Let’s try it out. Just like in stemming, you initialize an instance of WordNetLemmatizer and pass in individual words to its lemmatize method. What happened here? It seems that only the word ones got reduced to one, all the others are unchanged. If you read the words carefully, you’ll see that ones is the only plural noun here. In fact, that’s exactly why it got transformed. A lemmatizer needs to know or make an assumption about the part of speech for each word it’s trying to transform. In this case, WordNetLemmatizer defaults to nouns, but we can override that by specifying the PoS parameter. Let’s pass in ‘v’ for verbs. This time, the two verb forms ‘boring’ and ‘started’ got converted. Great. Note that there are other verbs, but they are already in the root form. Also, note how we passed in the output from the previous noun lemmatization step. This way of chaining procedures is very common. Let’s recap. As we saw in the previous examples, stemming sometimes results in stems that are not complete words in English. Lemmatization is similar to stemming with one difference, the final form is also a meaningful word. That said, stemming does not need a dictionary like lemmatization does. So depending on the constraints you have, stemming maybe a less memory intensive option for you to consider.

%d 블로거가 이것을 좋아합니다: