4 – M5 SC 11 Navigating The Parse Tree V1

Hello and welcome back. In this notebook, we will learn how to navigate the parse tree created by BeautifulSoup. So the most straightforward way of navigating the tree is by accessing the HTML or XML tags. We can access the tags as if they were attributes of the BeautifulSoup object as shown here. So let’s see an example. Let’s suppose we wanted to access the head tag of our sample HTML file. First, we will create a BeautifulSoup object called page content. Then we will access the head tag as if it was an attribute of this BeautifulSoup object but using page content.head. Whenever we access a tag in this manner, we get a tag object. So let’s print this tag object to see what it looks like. We can see that this tag object only has the contents of the head tag, including all of its sub-tags. We call these subtags the children of the head tag. We can access these child tags as if they were attributes of the page head tag object. For example, if we wanted to access this title tag, we would use page head.title. So if we run this code, we can see that we only get the contents of the title tag. Notice that this statement is equivalent to page content.head.title where page content is the BeautifulSoup object and not the tag object. Now notice the tag objects contain HTML tags. For example, here we see the opening and closing tags of this tag object. In most cases however, we do not want the tags but rather we only want the tags contained within the tags. For example, let’s suppose we only wanted to get the text “AI for Trading” that is in between these title tags. In these cases, we can use the get_text method that’s available to tag objects. So if we run this code, we can see that now we only get the text, “AI For Trading” without any HTML tags as we wanted. Now both HTML and XML tags can have attributes. For example, here we have an H1 tag that has the attribute id equal to intro. BeautifulSoup allows us to get the value of a tag’s attribute by treating the tag like a dictionary. Let’s see an example. Suppose we wanted to get the value of the id attribute of this H1 tag which in this case is equal to intro. To do this, we will first access the H1 tag as we have done previously. We will then get the value of the id attribute as if the tag object was a dictionary using square brackets. So if we run this code, we can see that we get the value of intro, which is indeed the value of the id attribute of the H1 tag. Now, if we look at our sample HTML file, we can see that it has two H2 tags. This one, and this one. If we try to access these H2 tags as we did before, we can see that we only get the first tag but not the second one. This is because when we access a tag as an attribute, we only get the first tag that appears in the file. In order to get all the H2 tags, we need to use the find all method which is the topic of our next lesson.

%d 블로거가 이것을 좋아합니다: