14 – AIT M5L4B 06 Introduction To Beautifulsoup V3

In the previous lessons, you learned how to create regular expressions and use them to find a specific patterns of text in documents. In some cases however, the text you want to analyze maybe already formatted as a website rather than in a plain text document. In principle, you could say the HTML contents of the website into a document first and then use regexs to parse out the HTML data. But in general, this is a very difficult thing to do especially if you’re starting from scratch. Luckily, Python already has a library that allows you to pull data directly from websites. This library is called Beautiful Soup and it can be used to extract the information you need from a particular section or sections of a website. Beautiful Soup is particularly useful when the original document is formatted at HTML or XML. For example, suppose we had a website that looks like this. Let’s assume the information we wanted was in this section. With the aid of the requests library, we can connect to this website and get the HTML data. Once we have the HTML data, we can use Beautiful Soup to extract only the text in this section and save it to a file. While Beautiful Soup is very powerful and can be used to extract data directly from 10-Ks that are in HTML or XML format, there are a couple of problems you might face in doing so. The first problem is that Beautiful Soup works best when you have perfectly formatted HTML. If your HTML document has missing information or mistakes then this can result in Beautiful Soup returning the wrong text. The second problem you might face is that you might not find the 10-Ks case in HTML or XML format. This is because a lot of older 10-Ks can only be found in text format, therefore, you won’t be able to use Beautiful Soup in these older 10-Ks because they’re not formatted in HTML or XML. In the next lesson, we will learn how to use Beautiful Soup to extract data from different sections of a website. Let’s get started.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading