8 – M5 SC 8 Metacharacters Part 3 V1

Hello, and welcome back. In this notebook, we will learn how to make more complicated regular expressions using groups, the question mark, and the asterisk metacharacters. Let’s see how they work. Here, we have a multiline string with the names and the heights of the four highest mountains in the world according to Wikipedia. Our goal in this lesson will be to create a regular expression that can find the names of all these mountains. The first thing to notice is that the word mountain has been abbreviated in two different ways, as Mt without a period and as Mt. with a period. Therefore, if we want to find all the names of the mountains, we need to create a regular expression that allows us to indicate that the period in the abbreviation is optional. We can do this by using the question mark metacharacter. The question mark will match zero or one repetitions of the preceding regular expression. For example, the question mark here will match Mt without a period, which corresponds to the case where there are zero repetitions of the period, and will also match Mt. with a period that corresponds to the case where there is one repetition of the period. In other words, the question mark here indicates that the period after the Mt is optional. So, if you run this code, we can see that we can match both abbreviations, both Mt without a period and Mt. with a period. So far, so good. Let’s continue building our regular expression. Now, we notice that the next character after the mountain abbreviation is a whitespace character. We will therefore use a backslash s in a regular expression to match this whitespace. After the whitespace character, we have the name of the mountain. We can see that the first letter in all the names is an uppercase letter, but we’ll therefore use a character set of A through Z to match any possible uppercase letter. Now comes the tricky part. We can see that all the mountain names have different lengths. However, they only contain alphanumeric characters. If you remember from our previous lesson, to match any alphanumeric character, we can use the sequence backslash w. Now, to be able to match names of any length, we will use the asterisk metacharacter. The asterisk metacharacter matches zero or more repetitions of their preceding regular expression. For example, ab asterisk will match a or a followed by any number of b’s such as ab or abbbbb. Therefore, backslash w asterisk will match zero or more alphanumeric characters. So, if we run this code, we can see that we managed to match all the mountain names regardless of their length or abbreviation. Now, let’s take a look at groups because they’re very useful. Here, we have added a new mountain to our list, but the name of this mountain has two differences from the other ones. The first is that mountain has been abbreviated as Mnt instead of Mt, and the first letter of the name is lowercase instead of uppercase. To be able to match these new abbreviation, we will use the parenthesis metacharacters to define a group. As their name suggests, groups group together the expressions contained inside them. For example, we saw before that ab asterisk will match a or a followed by any number of b’s, such as ab or abbbbb. But if we put ab inside a parenthesis to define a group, then the group ab asterisk will match zero or more repetitions of ab, such as ab, or abababab. We can repeat the contents of any group with any repeating qualifier such as the asterisk, the question mark, or curly brackets as we have seen before. We can also use the or metacharacter within a group to select between two expressions. For example, we can use the group Mt or Mnt in our regular expression to be able to match either the Mt or Mnt abbreviation. We also added lowercase letters to our previous character set to be able to match the first lowercase letter in the name of our new mountain. If we run this code, we can see that we can match all mountain names including our new mountain. I also wanted to mention that since the first letter in both abbreviations is an m, then alternatively, we could have put the m outside the group and we will get the same result as we can see here.

%d 블로거가 이것을 좋아합니다: