9 – Text Processing

Let’s take a closer look at text processing. The first question that comes to mind is, why do we need to process text? Why can we not feed it in directly? To understand that, think about where we get this text to begin with. Websites are a common source of textual information. Here’s a portion of a sample web page from Wikipedia and the corresponding HTML markup, which serves as our raw input. For the purpose of natural language processing, you would typically want to get rid of all or most of the HTML tags, and retain only plain text. You can also remove or set aside any URLs or other items not relevant to your task. The Web is probably the most common and fastest growing source of textual content. But you may also need to consume PDFs, Word documents or other file formats. Or your raw input may even come from a speech recognition system or from a book scan using OCR. Some knowledge of the source medium can help you properly handle the input. In the end, your goal is to extract plain text that is free of any source specific markers or constructs that are not relevant to your task. Once you have obtained plain text, some further processing may be necessary. For instance, capitalization doesn’t usually change the meaning of a word. We can convert all the words to the same case so that they’re not treated differently. Punctuation marks that we use to indicate pauses, etc. can also be removed. Some common words in a language often help provide structure, but don’t add much meaning. For example, a, and, the, of, are, and so on. Sometimes it’s best to remove them if that helps reduce the complexity of the procedures you want to apply later.

%d 블로거가 이것을 좋아합니다: