5 – Tokenization

Token is a fancy term for a symbol. Usually, one that holds some meaning and is not typically split up any further. In case of natural language processing, our tokens are usually individual words. So tokenization is simply splitting each sentence into a sequence of words. The simplest way to do this is using the split method which returns a list of words. Note that it splits on whitespace characters by default, which includes regular spaces but also tabs, new lines, et cetera. It’s also smart about ignoring two or more whitespace characters in a sequence, so it doesn’t return blank strings. But you can control all this using optional parameters. So far, we’ve only been using Python’s built-in functionality, but some of these operations are much easier to perform using a library like NLTK, which stands for natural language toolkit. The most common approach for splitting up texting NLTK is to use the word tokenized function from nltk.tokenize. This performs the same task as split but is a little smarter. Try passing in some raw text that has not been normalized. You’ll notice that the punctuations are treated differently based on their position. Here, the period after the title Doctor has been retained along with Dr as a single token. As you can imagine, NLTK is using some rules or patterns to decide what to do with each punctuation. Sometimes, you may need to split text into sentences. For instance, if you want to translate it. You can achieve this with NLTK using sent tokenize. Then you can split each sentence into words if needed. NLTK provide several other tokenizers, including a regular expression base tokenizer that you can use to remove punctuation and perform tokenization in a single step, and also a tweet tokenizer that is aware of twitter handles, hash tags, and emoticons. Check out the nltk.tokenize package for more details.

%d 블로거가 이것을 좋아합니다: