7 – M5 SC 7 Metacharacters Part 2 V1

In this notebook, we will see a practical application of regular expressions. In particular, we will use the special sequence backslash d to create regular expressions that will allow us to look for phone numbers. We will also learn about character sets and how they can be used to search for more complicated patterns of text. Here, we have a multi line string that mimics a phone book. Let’s suppose I wanted to find all the phone numbers in this multi line string. Notice that all the phone numbers have different digits and that they have different characters separating the digits. For example, here we have dashes, here we have whitespace, and here we have a parenthesis, and then a dash. Even though all these phone numbers have different digits, they all have the same pattern namely; three digits, followed by a single character, followed by three more digits, followed by a single character, followed by four more digits. But we’ll take advantage of this pattern to create a regular expression that can match all these phone numbers. This regular expression takes advantage of that pattern and can match all of these phone numbers. The first three backslash d’s can match any sequence of three digits. The dot can match any single character. These next three backslash d’s can match, again, any sequence of three digits. This dot, again, can match any single character. Finally, the last four backslash d’s can match any sequence of four digits. So, if we run this code, we can see that we’ve managed to match all the phone numbers in our multi line string, even though they all have different digits and different characters in-between the groups of numbers. Notice that by using the dot, we avoid the trouble of creating three different regular expressions to match the three possible characters separating the groups of numbers. Now, we can write this regular expression in a more compact form, as shown here, by using the curly bracket meta-characters. For example, these three in curly brackets specifies that exactly three copies of the previous regular expression should be matched. Therefore, this sequence backslash d, curly bracket three will match exactly three decimal digits. Similarly, this sequence will match exactly four decimal digits. Now, let’s suppose I only wanted to find phone numbers in which the groups of digits were separated by either a dash or a whitespace. So, in this example, I only want to find the first three phone numbers, since these are the only ones in which all the groups of numbers are separated by either a dash or a whitespace. To do this, we can use what is known as a character set. A character set is the set of characters that you wish to match. Character sets are specified by using the square bracket meta-characters. For example, this character set will match either a dash or a whitespace. Notice that there’s a whitespace after the dash. So, this regular expression will match any three digits followed by a dash or a whitespace, followed by three more digits, followed by either a dash or a whitespace, and followed by four more digits. So, if we run this code, we can see that we only match their first three phone numbers as we wanted. We didn’t match the fourth phone number because even though the last group of numbers is separated by a dash, the first group of numbers is separated by a parentheses which is not in our character set. It is important to note that even though a character set can have many characters, we can only match one of those characters at a time. For example, this character set can only match either a dash or a whitespace, but not both. So, if I add a whitespace after each stash in Mr. Brown’s phone number and run this code, I can see that now I get no matches even though the characters separating each group of numbers belong to our character set. Let’s see another example. Now, let’s suppose I only wanted to find phone numbers in which the groups of digits were separated by either a dash or a whitespace and they have area code, either 455 or 655. So, in this example, we want to match this phone number and this phone number. To do this, we can start a regular expression with a character set that has the numbers four and six. The next two numbers are going to be 55 and the rest of the regular expression is the same as we saw before. If we run this code, we can see that we can actually match the two phone numbers that we wanted with area codes 455 and 655 and whose digits are separated either by a dash or a whitespace. Now, let’s suppose I wanted to look for phone numbers that end on the numbers six, seven, eight or nine. To do this, we could use a character set like this that contains the numbers six, seven, eight and nine. However, there is a more compact form to do this. Within a character set, when a dash is placed between digits or letters, it is actually used to specify a range. Therefore, this character set can match either a six, a seven, an eight or nine and is equivalent to this character set. So, if we run this code, we can see that we can actually match all the phone numbers that have the last digit in the range six to nine. Notice that we didn’t match the last phone number because the last digit is a four. The four is not in the range six through nine. The dash can also be used to specify a range between letters. For example, the character set a-f is the same as the character set a, b, c, d, e, f. It’s also important to note that when a dash is placed at the beginning of a character set like we did in this example, the dash is taken literally and is not used to specify any range. As our last example, let’s suppose I wanted to find the phone numbers that do not end on the numbers six, seven, eight or nine. In this case, we could use the character set one through five. However, we can also use the carrot. We already learned that outside of a character set, the caret matches a sequence of characters when they are located at the beginning of a string. However, when the caret appears at the beginning of a character set like here, it negates the set. These means it matches everything that is not in that character set. For example, this character set with a caret at the beginning will match any character that is not a six, a seven, or eight or a nine. So, if we run this code, we can see that we only get the last phone number because his last digit is a four and the four is not in the range six through nine. The caret also works for letters. Therefore, this character set will match any character that is not a lowercase or uppercase letter.