Up: Regular Expressions

Patterns

slide 01 Hello, and welcome to the fourth episode of the Software Carpentry lecture on regular expressions. In this episode, we’ll have a look at a few more patterns you can use to build up regular expressions.
slide 02 If you recall, we’re trying to parse data from notebooks recording background evil levels in millivaders at several sites in the Shire a couple of years after the explosion of the Death Star.
slide 03 These records are in different formats, and a couple of episodes ago…
slide 04 …we managed to build this function to extract the dates from those records. Inside this function, we’re applying a regular expression to the record. If it matches, we’re returning the matched groups, reordering them as necessary so that we always get back year, month, and day.
slide 05 This version of the function does a better job of pulling data out of our records. First, it gets the site and reading, as well as the year, month, and day. Second, and more importantly, this function is more declarative: the variable patterns stores one entry for each format of record we think we have to parse.
slide 06 The first element of each entry is a regular expression to match data in that format.
slide 07 The remaining fields in the entry are a permutation of the indices of the groups in that pattern.
slide 08 In our loop, we pull the pattern and the indices for the year, month, day, site, and reading out of each entry in the table in turn.
slide 09 If the pattern matches…
slide 10 we then return the matched groups, permuting them according to the indices so that the data always comes back in the same order: year, month, day, site, and reading.
slide 11 Why is this better? Well, every time we have another data format to match, all we have to do is add one more entry. This makes this function very easy to extend, and very easy to test.
slide 12 So let’s take a look at Notebook #3. It has the date as three fields, the site name in parentheses, and then the reading. We know how to parse dates in this format…
slide 13 …and the fields are separated by spaces…
slide 14 …but how do we match against those parentheses?
slide 15 So far, when we’ve seen parentheses in regular expressions, they haven’t matched characters: they’ve created groups.
slide 16 The way we solve this problem—i.e., the way we match a literal open parenthesis ‘(‘ or close parenthesis ‘)’ using a regular expression—is to put backslash-open parenthesis ‘\(‘ or backslash-close parenthesis ‘\)’ in the RE.
slide 17 This is another example of an escape sequence. Just as we use the two-character sequence ‘\t’ in a string to represent a literal tab character, we use the two-character sequence ‘\(‘ or ‘\)’ in a regular expression to match the literal character ‘(‘ or ‘)’.
slide 18 However, in order to get that backslash ‘\’ into the string, we have to escape it by doubling it up.
slide 19 So the string representation of the regular expression that matches an opening parenthesis is actually ‘\\(‘. This might be confusing, so let’s take a look at how the various layers work.
slide 20 Our program text—i.e., what’s stored in our .py file—looks like this. Here, we have two backslashes, an open parenthesis, two backslashes, and a close parenthesis inside quotes.
slide 21 When Python reads that file in, it turns the two-character sequence ‘\\’ into a single literal ‘\’ character in the string in memory. That’s the first level of escaping.
slide 22 When we hand the string ‘\(\)’ to the regular expression library, it takes the two-character sequence ‘\(‘ and turns it into an arc in the finite state machine that matches a literal parenthesis. Turning this over, if we want a literal parenthesis to be matched, we have to give the regular expression library ‘\(‘. If we want to put ‘\(‘ in a string, we have to write it in our .py file as ‘\\(‘.
slide 23 With that out of the way, let’s go back to Notebook #3. The regular expression that will extract the five fields from each record…
slide 24 …looks like this: '([A-Z][a-z]+) ([0-9]{1,2}) ([0-9]{4}) \\((.+)\\) (.+)' A word beginning with an upper-case character followed by one or more lower-case characters, a space, one or two digits, another space, four digits, another space, some stuff involving backslashes and parentheses, another space, and then one or more characters, which is the reading.
slide 25 If we take a closer look at that “stuff”, ‘\\(‘ and ‘\\)’ are how we write the regular expressions that match a literal open parenthesis ‘(‘ or close parenthesis ‘)’ character in our data.
slide 26 The two inner parentheses that don’t have backslashes in front of them create a group, but don’t match any literal characters.
slide 27 We create that group so that we can save the results of the match—in this case, the name of the site.
slide 28 Now that we know how to work with backslahes in regular expressions, we can take a look at character sets that come up frequently enough to deserve their own abbreviations.
slide 29 If you use ‘\d’ in a regular expression, it matches the digits 0 through 9.
slide 30 If you use ‘\s’, it matches the whitespace characters (space, tab, carriage return, and newline).
slide 31 And ‘\w’ matches word characters: it’s equivalent to the set shown on the right of upper-case letters, lower-case letters, digits, and the underscore '[A-Za-z0-9_]'.
slide 32 This might seem a funny definition of “word”; it’s actually the set of characters that can appear in a variable name in a programming language like C or Python.
slide 33 Again, in order to write one of these regular expressions as a string in Python, you have to double up the backslashes.
slide 34 Now that we’ve seen these character sets, we can take a look at an example of really bad design.
slide 35 ‘\S’ means “non-space characters”, i.e., everything that isn’t a space, tab, carriage return, or newline. That might seem to contradict what I said a few seconds ago…
slide 36 but that’s an upper-case ‘S’, not a lower-case ‘s’.
slide 37 Similarly, and unfortunately, ‘\W’ means “non-word characters”…
slide 38 …provided it’s an upper-case ‘W’. Upper- and lower-case ‘S’ and ‘W’ look very similar, particularly when there aren’t other characters right next to them to give context.
slide 39 This means that these sequences are very easy to mis-type…
slide 40 …and what’s worse, even easier to mis-read. Everyone eventually uses an upper-case ‘S’ when they meant to use a lower-case ‘s’, or vice versa, and then wastes a few hours trying to track it down. So please, if you’re ever designing a library that’s likely to be widely used, try to choose a notation that doesn’t make mistakes this easy.
slide 41 Along with the abbreviations for character sets, the regular expression library recognizes a few shortcuts that match things that aren’t actual characters.
slide 42 For example, if you put a circumflex ‘^’ at the start of a pattern, it matches the beginning of the input text.
slide 43 So the pattern '^mask' will match the text 'mask size' because the letters ‘mask’ come at the start of the string.
slide 44 But that same pattern will not match the word 'unmask'.
slide 45 Going to the other end, if dollar sign ‘$’ is the last character in the pattern, it matches the end of the input text rather than a literal ‘$’.
slide 46 So ‘temp$’ will match the string ‘high-temp’…
slide 47 …but it won’t match the string ‘temperature’.
slide 48 A third shortcut that’s often useful is ‘\b’, often called “break”. It matches the boundary between word and non-word characters: it doesn’t actually match any characters—it doesn’t consume any input—but it matches the transition between non-word characters and letters, digits, and the underscore.
slide 49 If we have '\\bage\\b', it will match the string 'the age of', because there’s a non-word character right before the ‘a’, and another non-word character right after the ‘e’.
slide 50 That same pattern will not match the word 'phage' because there isn’t a transition from non-word to word characters, or vice versa, right before the ‘a’.
slide 51 We’ve now seen about a dozen of the atoms that are used to build regular expressions. There are many more, and every language or library adds a few of its own. In the next episode, we’ll take a closer look at the functions in the regular expression library that are used to apply these to problems.

  1. Terri Yu
    February 24th, 2011 at 18:39 | #1

    In Slides 25-27, the pattern ‘([A-Z][a-z]+ [0-9]{1,2} ([0-9]{4}) \\((.+)\\) (.+)’ at the bottom of the slide has an unmatched parentheses (first character).