Up: Regular Expressions

Introduction

slide 01 Hello, and welcome to the first episode of the Software Carpentry lecture on regular expressions. In order to understand what regular expressions are for, let’s have a look at the sort of cleanup job that comes up frequently when dealing with real field data.
slide 02 A couple of years after the Death Star exploded…
slide 03 …a hot-shot reporter at the Daily Planet heard that…
slide 04 …children in the Shire…
slide 05 …were starting to act a little bit strangely.
slide 06 Your supervisor…
slide 07 …sent some of his grad students off to collect some data.
slide 08 Things didn’t go so well for them…
slide 09 …but their notebooks were recovered and later transcribed.
slide 10 Your job is to read in 20 or 30 files, each of which contains several hundred measurements of background evil levels, and convert them into a uniform format so that they can be processed further.
slide 11 Each of the readings has the name of the site where the reading was taken, the date the reading was taken on, and of course the background evil level in millivaders. The problem is, these files aren’t formatted in the same way.
slide 12 Some of them use tabs to separate fields, others use commas.
slide 13 And the dates are written in several different styles.
slide 14 Let’s take a look at one of those files. As you can see…
slide 15 …it uses a single tab between each column as a separator.
slide 16 (While the spaces in site names are visually similar, they’re different characters.)
slide 17 The good news is, the dates are written in the international standard format: four digits for the year, two for the month, and two for the day.
slide 18 Let’s have a look at the second notebook.
slide 19 Here, they’re using slashes as separators.
slide 20 There don’t appear to be spaces in the site names…
slide 21 …but the month names and day numbers are of varying length. The months are text, and the order is month-day-year rather than year-month-day. Parsing these files using string search would be difficult and error-prone.
slide 22 The right solution is to use regular expressions.
slide 23 A regular expression is just a pattern that a string can match. You’ve probably used these already.
slide 24 When you say "*.txt" to a computer, you’re matching all of the filenames that end in ".txt". The "*" matches any number of characters: it’s a pattern.
slide 25 A warning before we go any further: the notation for regular expressions is ugly, even by the standards of programming.
slide 26 The problem is that we’re writing patterns to match strings, but we’re writing those patterns as strings…
slide 27 …using only the symbols that are on the keyboard, instead of inventing new symbols the way mathematicians often do.
slide 28 Let’s start by reading in data from two files, and grabbing the first few lines of each.
slide 29 When we print out the results in the list readings, we can see that we’ve got six lines from the first data file, and six from the second. We’ll test our regular expressions against this data to see how well or how poorly we’re matching different formats of records as we go along.
slide 30 Without regular expressions, we can select records that have the month “06″ just by saying, if "06" is in the record.
slide 31 If we want to select data for two months, we have to say, if '06' in the record or if '07' in the record.
slide 32 We should realize that there’s a problem here. If we say, '05' in record, it isn’t matching against the month: it’s matching against the day.
slide 33 Right now, we have no easy way to distinguish those two cases. This is a problem we’ll come back to later.
slide 34 Let’s try using a regular expression to do our matches instead of the simple string-in operator. We import the regular expressions library, and then say, for each record, if regular expression search can find a match for the string '06' in the record, then we’ll print it out.
slide 35 So far, this is matching exactly what "06" in r would match—it’s not much of an improvement.
slide 36 But look what happens if we want to match a month of “06″ or a month of “07″. We can combine the two in a single pattern. Let’s take a closer look at this code.
slide 37 The first argument to re.search is the pattern we are searching for.
slide 38 That pattern is written down as a string.
slide 39 The second argument is the data we are searching in.
slide 40 It’s quite common to get these reversed: a very common mistake is to put the data first and the pattern second. This can be quite hard to track down, so please be careful.
slide 41 The vertical bar in the pattern means “or”.
slide 42 We’re telling the regular expression engine that we want to match either the text on the left of the vertical bar, or the text on the right, but we’re going to do the match in a single search.
slide 43 We’re going to be trying to match a lot of patterns against our data, so let’s write a function that will tell us which records match a particular pattern. Our function show_matches takes a pattern and a list of strings, and then for each of those strings, if the pattern matches, we print out two stars as a marker, otherwise we just print out some blanks.
slide 44 Let’s test our function right away. If we try to match '06|07' against the data that we read in earlier, it seems to be doing the right thing: we’ve got stars beside the two records that have month '06' or month '07'.
slide 45 But why doesn’t this work? If we match '06|7', it seems to be matching a lot of things that don’t have the month '06' or '07'.
slide 46 Think back to mathematics. The expression ab+c means ,”a times b plus c.”
slide 47 Multiplication is implied simply by putting a and b next to each other, and it has higher precedence than addition: we always do multiplication before we do addition.
slide 48 If we want to force the other meaning, we write, “a times (b plus c).”
slide 49 The same thing happens with regular expressions. If we say '06|7', it means exactly that: either '06' or the digit '7'.
slide 50 And if you look back at our data, there are a lot of 7′s in our file.
slide 51 If we want to match '06' or '07', we can parenthesizes as shown here: '0(6|7)'.
slide 52 Having said that, the expression '06|07' is probably more readable to most people anyway.
slide 53 Let’s go back to our function and our data. If we do matches for '05', then as we said earlier, we’re pulling up records that have '05' as the day, rather than as the month. We can force our match to do the right thing by taking advantage of context.
slide 54 If we want to match a month…
slide 55 …there should be a dash '-' before and after the numbers. So if we try to match '-05-'
slide 56 …we show no matches, which is the correct answer: we don’t have any readings in this sample of our data set for May.
slide 57 Matching is all well and good, but what we really want to do is extract data: we want to pull the year, the month, and the day out of our data set so that we can reformat them.
slide 58 When a regular expression matches a piece of text, the regular expression library remembers what matched against every parenthesized sub-expression. Parentheses aren’t just used for grouping: they’re also used to remember things.
slide 59 Here’s an example.
slide 60 The pattern to match years, '(2009|2010|2011)', has been put in parentheses. This will match 2009, or 2010, or 2011, but it will remember which of those it matched.
slide 61 The second string is just the first record from our data.
slide 62 (If you recall, '\t' represents a tab.)
slide 63 When re.search is called, it returns a match object if a match is found.
slide 64 If no match is found it returns None, meaning, “There’s no useful information.”
slide 65 The expression match.group returns the text that matched a particular parenthesized sub-expression. For example, match.group(1) returns whatever matched against the pattern inside the first pair of parentheses counting from the left.
slide 66 It’s important to note that the first sub-expression is extracted with match.group(1), the second with 2, and so forth. When we’re looking at groups, we count from 1 to N, rather than from 0 to N-1, as is normal in the rest of Python.
slide 67 The reason for this is that match.group(0) returns all of the text that the entire pattern matched.
slide 68 What if we want to match the month as well as the year? A regular expression to match legal months would be '01' or '02' or '03' and so forth all the way up to '12'.
slide 69 The expression to match the day would be three times longer. This is pretty cumbersome: it’s hard to type, and more importantly, hard to read.
slide 70 In a regular expression, you can use a dot—the period character '.'—to match any single character.
slide 71 So the expression '....-..-..' matches any four characters—exactly four characters—followed by a dash, followed by two more characters, followed by another dash, followed by two more characters.
slide 72 If we put each set of dots in parentheses, we should get out three groups recording the year, month, and day every time there’s a successful match.
slide 73 Let’s test that out. Here, we’re calling re.search with the pattern we just described and the first record from out data. When we print out match.group(1), 2, and 3, sure enough, we get '2009', '11', and '17', just as we wanted.
slide 74 Try doing that with substring searches.
slide 75 To recapitulate, letters and digits in a pattern match against themselves: the character 'A' in a pattern matches the character 'A' in the data, and so forth.
slide 76 Vertical bar '|' means OR…
slide 77 …dot '.' matches any single character…
slide 78 …we use parentheses '()' to enforce grouping…
slide 79 re.search returns a match object if the pattern matches, and None if there isn’t a match….
slide 80 …and if a match was found, match.group(k) is the text that matched group k.
slide 81 More generally, stepping back from the details of regular expressions…
slide 82 …the right way to build up patterns is to start with something simple that matches part of the data you’re working with.
slide 83 Test it against your data, but also test that it doesn’t match things that it shouldn’t, because it can be very hard to track down false positives.
slide 84 Once you’ve done that, extend it piece by piece to handle other cases.
slide 85 We’ll take a look at how to do more of this in the next episode.

  1. No comments yet.