Up: Regular Expressions

Operators

slide 01 Hello, and welcome to the second episode of the Software Carpentry lecture on regular expressions. In this episode, we’ll have a look at some operators you can use in your regular expressions.
slide 02 If you recall, we have several notebooks full of data measuring background evil levels in millivaders. Notebook #1 has these as site, date, and background evil level…
slide 03 … with single tabs as separators.
slide 04 Some of the site names have spaces…
slide 05 …and the dates are in the international standard format, with four digits for the year, two for the month, and two for the day.
slide 06 The data in Notebook #2 also has three fields…
slide 07 …but these are separated by slashes.
slide 08 Months are reported using their names, and are of varying length. The days are also of varying length.
slide 09 We saw in the previous episode that regular expressions are patterns that can be used to match text. Letters and digits match themselves; vertical bar ‘|’ means OR; the dot ‘.’ matches any single character; you can use parentheses ‘()’ to enforce grouping; the re.search method returns a match object if a match is found or None if one is not; and if a match is found, match.group(k) is the text that matched parenthesized group k.
slide 10 Before we look at how to use regular expressions to extract data from Notebook #2, let’s see how we would do it with simple strings. If our record is the string shown in the first line of code ('Davison/May 22, 2010/1721.3') we could split on slashes to get the site, the date, and the reading, then split the middle field on spaces to get month, day, and year, and then remove the comma from the day if it is present, because if you recall, some of our readings don’t have a comma after the day.
slide 11 This is a procedural way to solve the problem: we are telling the computer how to do something.
slide 12 Regular expressions, by contrast, are declarative: we tell the computer what we want, and it figures out how to do it.
slide 13 Our first attempt to parse this data relies on the star ‘*’ operator.
slide 14 ‘*’ means “zero or more repetitions of the pattern that comes before it”.
slide 15 It is a postfix operator, just like the 2 in x2.
slide 16 So ‘.*’ means “zero or more characters”, because ‘.’ matches any character, and ‘*’ forces the preceding pattern—the ‘.’—to match zero or more times.
slide 17 In order for the entire pattern to match, the slashes ‘/’ have to line up exactly, because ‘/’ matches against itself. That’s why this seems to grab the site name, the date, and the reading correctly.
slide 18 Unfortunately, we’ve been over-generous. Our pattern matches the string ‘//’, and here we’re printing out a ‘*’ as well as the group so that you can see there actually are three lines of output.
slide 19 ‘.*’ can match the empty string, because that’s zero or more occurrences of a character.
slide 20 That means our pattern will accept badly-formatted data, which is likely to cause us headaches down the road.
slide 21 Let’s try a variation that uses ‘+’ (plus) instead of ‘*’ (star).
slide 22 In a regular expression, ‘+’ is a postfix operator meaning “one or more”, i.e., it has to match at least one occurrence of the pattern that comes before it. As you can see, the pattern ‘(.+)/(.+)/(.+)’ doesn’t match a string containing only slashes because there aren’t characters before, between, or after the slashes for the .+’s to match.
slide 23 If we go back and check it against real data, it seems to be doing the right thing.
slide 24 We’re actually going to be matching a lot of patterns against a lot of strings, so let’s write a function that will apply a pattern to a piece of text, report if there is no match, and if there is a match, print out all of the groups in order. Here, we’re testing our little function against the record we were just using.
slide 25 If we’re using regular expressions to extract the site, the date, and the reading, why not break up the date while we’re at it? This patterns pulls out the month, the day, and the year at the same time as it pulls out the site and the reading.
slide 26 But wait a second: why doesn’t this work?
slide 27 You probably didn’t notice that this record does not have a comma after the day. The pattern does have one, so this pattern doesn’t match this string.
slide 28 Let’s fix that by putting a question mark ‘?’ after the comma. In a regular expression, ‘?’ is a postfix operator meaning “0 or 1 of whatever comes before it”.
slide 29 I.e., the pattern that comes before the question mark is optional. Now, this pattern successfully matches data without a comma…
slide 30 …and when we test on data with a comma, it still works.
slide 31 Let’s tighten up our pattern a little bit more. We don’t want to match this record.
slide 32 Somebody has mis-typed the year, and given us three digits instead of four—either that, or whoever took this reading was taking advantage of the physics department’s time machine.
slide 33 We could use four dots in a row to force the pattern to match exactly four digits…
slide 34 …but this won’t win any awards for readability.
slide 35 Instead, let’s put the digit ’4′ in curly braces ‘{}’ after the dot.
slide 36 Curly braces with a number between them in a regular expression is a postfix operator meaning “match the pattern exactly this many times”. Here, we mean “match ‘.’ four times against the string”.
slide 37 Let’s do a few more tests. Here are some records in which the dates are either correct or mangled. And here’s a pattern that should match all the records that are correct, but should fail to match all the records that have been mangled. We are expecting four digits for the year…
slide 38 …and we are allowing 1 or 2 digits for the day: the expression ‘{M,N}’ matches a pattern from M to N times. Here, we’re allowing from 1 to 2 characters for the day.
slide 39 When we run this pattern against our test data, we see that three records match. The second and third make sense: ‘May 2′ is valid, and ‘May 22′ is valid.
slide 40 But why does ‘May’ with no date at all match this pattern? Let’s look at that test case more closely.
slide 41 The groups are ‘Davison’ (that looks right), ‘May’ (looks good so far), a ‘,’ on its own (which is clearly wrong), and then the right year and the right reading.
slide 42 Here’s what’s happened. The space ‘ ‘ after ‘May’ matches the space ‘ ‘ in the pattern.
slide 43 The expression “1 or 2 occurrences of any character” matches the comma ‘,’ because ‘,’ is a character and it’s occurring once.
slide 44 The expression ‘,?’ is then not matched against anything, because it’s allowed to match zero characters. ‘?’ means “optional”, and in this case, the regular expression pattern matcher is deciding not to match it against anything, because that’s the only way to get the whole pattern to match the whole string.
slide 45 And then of course the second space matches the second space in our data. This is obviously not what we want, so let’s modify our pattern again.
slide 46 The pattern here ('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)') does the right thing for the case where there’s no day, and also does the right thing for the case where there are characters for the day.
slide 47 What’s going on? Well, instead of using ‘.’, we’re using ‘[0-9]‘. In a regular expression, square brackets ‘[]‘ are used to create a set of characters.
slide 48 For example, the expression ‘[aeiou]‘ will match exactly one vowel: it matches one instance of any character in the set. You can either write these sets out character by character, as we’ve done with vowels, or if the characters are in a contiguous range, write them as “first character ‘-’ last character”, as we’ve done with the digits.
slide 49 Here’s our completed pattern: '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)'
slide 50 We’ve added one more feature to it: the name of the month has to begin with an upper-case letter, i.e., a character in the set ‘[A-Z]‘…
slide 51 …followed by one or more lower-case characters in the set ‘[a-z]‘.
slide 52 The day is one or more occurrences of the digits 0 through 9.
slide 53 This will allow “days” like ’0′, ’00′, ’99′, and so on.
slide 54 We’re going to check for that after we convert the day to an integer…
slide 55 …since the valid range depends on which month we’re in, and that can’t be done declaratively—think, for example, about how we would have to handle leap years.
slide 56 Finally, the year is exactly four digits, so it’s the set of characters ‘[0-9]‘ repeated four times.
slide 57 Again, we’ll check for invalid values like ’0000′ after we convert to integer.
slide 58 With the tools we’ve seen so far, we can write a simple function that will extract the date from either of the notebooks we’re looking at, and return the year, the month, and the day as strings. First, we test to see if the record has a match for an ISO-formatted date: four digits for the year, dash, two for the month, dash, two for the day. If it does, then we’re done: we return those three fields. Otherwise, we test the record to see if we can find the name of a month, one or two digits for the day, and then four digits for the year, within slashes. If so, we return those, permuting the order so that it’s year, month, day. If neither pattern matched then we return None to signal that we can’t do anything. This is a very common way to use regular expressions: rather than trying to combine everything into one enormous pattern, we have one pattern for each valid format of data. We test, and if the test succeeds, we return what we found. If it doesn’t, we move on to the next pattern. Working this way is more readable; it’s also easier to extend if we have to handle other data formats.
slide 59

  1. No comments yet.