4 – Normalization

Plain text is great but it’s still human language with all its variations and bells and whistles. Next, we’ll try to reduce some of that complexity. In the English language, the starting letter of the first word in any sentence is usually capitalized. All caps are sometimes used for emphasis and for stylistic reasons. While this is convenient for a human reader from the standpoint of a machine learning algorithm, it does not make sense to differentiate between Car, car, and CAR, they all mean the same thing. Therefore, we usually convert every letter in our text to a common case, usually lowercase, so that each word is represented by a unique token. Here’s some sample text, a review for the movie, The Second Renaissance, a story about intelligent robots that get into a fight with humans over their rights. Yup, the way we treat robots these days. Anyway, if we have the reviews stored in a variable called text, converting it to lowercase is a simple call to the lore method in Python. Here’s what it looks like after a conversion. Note all the letters that were changed. Other languages may or may not have a case equivalent but similar principles may apply depending on your NLP task, you may want to remove special characters like periods, question marks, and exclamation points from the text and only keep letters of the alphabet and maybe numbers. This is especially useful when we are looking at text documents as a whole in applications like document classification and clustering where the low level details do not matter a lot. Here, we can use a regular expression that matches everything that is not a lowercase A to Z, uppercase A is Z, or digits zero to nine, and replaces them with a space. This approach avoids having to specify all punctuation characters, but you can use other regular expressions as well. Lowercase conversion and punctuation removal are the two most common text normalization steps. Whether you need to apply them and at what stage depends on your end goal and the way you design your pipeline.

%d 블로거가 이것을 좋아합니다: