10 – Summary

We have covered a number of text processing steps. Let’s summarize what a typical workflow looks like. Starting with a plain text sentence, you first normalize it by converting to lowercase and removing punctuation, and then you split it up into words using a tokenizer. Next, you can remove stop words to reduce the vocabulary you have to deal with. Depending on your application, you may then choose to apply a combination of stemming and lemmatization to reduce words to the root or stem form. It is common to apply both, lemmatization first, and then stemming. This procedure converts a natural language sentence into a sequence of normalized tokens which you can use for further analysis.

%d 블로거가 이것을 좋아합니다: