9 – Stemming And Lemmatization

In order to further simplify text data, let’s look at some ways to normalize different variations and modifications of words. Stemming is the process of reducing a word to its stem or root form. For instance, branching, branched, branches et cetera, can all be reduced to branch. After all, they conveyed the idea of something … Read more

8 – Named Entity Recognition

Named entities are typically noun phrases that refer to some specific object, person, or place. You can use the ne_chunk function to label named entities in text. Note that you have to first tokenize and tag parts of speech. This is a very simple example, but notice how the different entity types are also recognized: … Read more

7 – Part-of-Speech Tagging

Remember parts of speech from school? Nouns, pronouns, verbs, adverbs, et cetera. Identifying how words are being used in a sentence can help us better understand what is being said. It can also point out relationships between words and recognize cross references. NLTK, again, makes things pretty easy for us. You can pass in tokens … Read more

6 – Stop Word Removal

Stop words are uninformative words like, is, our, the, in, at, et cetera that do not add a lot of meaning to a sentence. They are typically very commonly occurring words, and we may want to remove them to reduce the vocabulary we have to deal with and hence the complexity of later procedures. Notice … Read more

5 – Tokenization

Token is a fancy term for a symbol. Usually, one that holds some meaning and is not typically split up any further. In case of natural language processing, our tokens are usually individual words. So tokenization is simply splitting each sentence into a sequence of words. The simplest way to do this is using the … Read more

4 – Normalization

Plain text is great but it’s still human language with all its variations and bells and whistles. Next, we’ll try to reduce some of that complexity. In the English language, the starting letter of the first word in any sentence is usually capitalized. All caps are sometimes used for emphasis and for stylistic reasons. While … Read more

3 – Cleaning

Text data, especially from online sources, is almost never clean. Let’s look at the Udacity course catalog as an example. Say you want to extract the title and description of each course or Nanodegree. Sounds simple, right? Let’s jump into Python and give it a shot. You can follow along by downloading and launching the … Read more

2 – Capturing Text Data

The processing stage begins with reading text data. Depending on your application, that can be from one of several sources. The simplest source is a plain text file on your local machine. We can read it in using Python’s built in file input mechanism. Text data may also be included as part of a larger … Read more

10 – Summary

We have covered a number of text processing steps. Let’s summarize what a typical workflow looks like. Starting with a plain text sentence, you first normalize it by converting to lowercase and removing punctuation, and then you split it up into words using a tokenizer. Next, you can remove stop words to reduce the vocabulary … Read more

1 – Text Processing

In this lesson, you’ll learn how to read text data from different sources and prepare it for feature extraction. You’ll begin by cleaning it to remove irrelevant items, such as HTML tags. You will then normalize text by converting it into all lowercase, removing punctuations and extra spaces. Next, you will split the text into … Read more