3 – Cleaning

Text data, especially from online sources, is almost never clean. Let’s look at the Udacity course catalog as an example. Say you want to extract the title and description of each course or Nanodegree. Sounds simple, right? Let’s jump into Python and give it a shot. You can follow along by downloading and launching the text processing notebook. We can fetch the web page like any other online resource using the requests library. It looks like we got back the entire HTML source. This is what the browser needs to render the web page. But most of this is useless for our purposes. We need a way to extract all the plain text as visible on the website. How about using regular expressions? Let’s define a pattern to match all HTML tags and remove them by replacing with a blank string. Okay, that did something. We can see that the page title has been extracted successfully, but there is a lot of JavaScript and a number of other items that we don’t need. In fact, this regular expression somehow didn’t match some tags. Maybe they were nested inside other tags. Maybe we need to account for tags spread across lines. Anyway, this doesn’t seem like the best approach for this job. What we really need is a way to parse the HTML, just like a web browser, and pull out the relevant elements. Introducing BeautifulSoup. It’s a nice Python library meant to do exactly that. You just pass in the raw web page text which in this case contains HTML to create a soup object, and then you can extract the plain text, leaving behind any HTML tags using a symbol called to the get-text method. This takes care of nested tags, tags that are broken across lines, and a multitude of other edge cases that make HTML parsing a pain. It also forgives some small errors in HTML just like browsers, making it more robust. Let’s see. That’s better. I don’t see any HTML tags, but there are still a bunch of JavaScript and a lot of spaces. What else can we do? Let’s take a look at how the HTML source is structured. The easiest way to do this is to right click on an element of your choice, here, this course title, and choose Inspect or View Page Source. Now look at where the title is placed and what is the most distinct way of finding it in the HTML documentary. Here, we have a parent div with a class of course summary card. That sounds promising. Let’s use it. BeautifulSoup is actually very powerful. It enables you to walk the tree or dorm in many different ways. Here we are asking the library to find all divs with a class of course summary card. The result returned is a list of all such divs in the document. Let’s store this in a variable and look at one of the divs. Okay. Scrolling through this, I see that the title is stored in this a-tag which is contained in this H3 tag. How do we extract this title? One way to get to it is using a CSS selector. And now we can fetch the plain text content just like we did before. Great. One last thing. Let’s strip out the extra whitespace from both ends. There you go. Now, let’s look back at the HTML to see how we can grab the description text. There it is. It’s a div with an attribute called data-course-short-summary, but no value or any other attribute. Again, there is a way to select such tags using CSS. Specify the tag name, here, div, followed by the attribute name in square brackets. Looks good. Let’s extract the text and clean it up. All right. We can now repeat this over all course summaries. To do this, we can use a simple for loop. Looks spot on to me. Let’s store this data so we can use it later. Here, we are simply keeping the data in a list called courses. What we did just now is called scraping a web page. Although it sounds a little violent, trust me, it’s not. In fact, scraping is very common. Google News is a prime example. It pulls out the title and first sentence or two from news articles and displays them. Google probably uses a combination of rules and machine learning to identify what portion of the HTML contains the title and the beginning of the article text that it can use as a preview. It works great most of the time, but sometimes, it does fail. Here, for this article on quantum entanglement, the preview doesn’t seem to match the title at all. It looks more like a caption for this image. What likely happened is that the caption was the first piece of text on the web page and Google’s algorithm picked that up as if it was part of the main article. This just goes to show how seemingly routine tasks and text processing are still not solved all the way. Okay, let’s look back at what we just achieved. We started by fetching a single web page, the Udacity course catalog. Then we tried a couple of methods to remove HTML tags. We finally settled on using BeautifulSoup to parse the entire HTML source, find all course summaries, and extract the title and description for each course. And then we saved them all in a list. Depending on what you’re planning to do next, you may continue to treat these chunks as part of a single document or consider each to be a separate document. The latter is useful, for instance, if you want to group related courses. The problem then reduces to document clustering.

%d 블로거가 이것을 좋아합니다: