Up: Regular Expressions

More Tools

slide 01 Hello, and welcome to the fifth and final episode of the Software Carpentry lecture on regular expressions. In this episode, we’re going to work through a moderately complex problem using regular expressions, and introduce a few more tools in the regular expression library.
slide 02 Our starting point is an archive of several thousand papers and theses written in LaTeX, a text-based document formatting program.
slide 03 Our documents look like this.
slide 04 They all use the same labels to refer to items in a shared bibliography.
slide 05 Our job is to find out how often citations appear together, i.e., how often paper X is cited in the same document as paper Y.
slide 06 As a first step, we need to extract the set of citation labels from each document, and that’s the problem we’ll tackle in this episode.
slide 07 Let’s have a closer look at our input. In LaTeX, citations are written using ‘\cite’, with cross-reference labels in curly braces.
slide 08 A single citation can include two or more labels separated by commas.
slide 09 There may be white space before or after labels.
slide 10 There can even be line breaks where a citation is split across two lines.
slide 11 And there can be multiple citations per line.
slide 12 Our first idea is to use a group to capture everything inside the curly braces following the word ‘cite’. It seems to work in one simple case.
slide 13 But what if there are multiple citations on a single line?
slide 14 It looks like we’re capturing the text between the citations.
slide 15 The reason is that regular expression matching is greedy: it matches as much text as it can, and the ‘.’ in ‘.+’ will match all the characters from the first curly brace to the last one, including the intervening citations and curly braces.
slide 16 Luckily, the diagnosis of our problem suggests its solution: let’s have the regular expression match everything except a closing curly brace.
slide 17 This is remarkably easy to do: if the first character in a set in square brackets is the circumflex ‘^’, then the set is negated, i.e., it matches everything except the characters in the set. The expression ‘[^}]‘ therefore matches every character except a closing curly brace.
slide 18 This works for a single citation.
slide 19 All we’ve done is change ‘.’ to the negated set.
slide 20 What about multiple citations on a single line?
slide 21 Well, it’s not gobbling up text we don’t want it to, but it’s only capturing the first citation.
slide 22 Somehow, we need to extract all matches, not just the first.
slide 23 Luckily, the regular expression library has a function to do exactly this: if we use re.findall instead of re.search, it will give us back a list of all the substrings that matched our pattern.
slide 24 Remember, whatever your problem is, someone has probably run into it before, and there’s probably something in the library to help you. Knowing what’s in the library is as important to a programmer as knowing what’s in the literature is to a scientist.
slide 25 Let’s give findall a try. It seems to produce the right output.
slide 26 Not bad for a 7-character change.
slide 27 OK, what about spaces in citations?
slide 28 The good news is, nothing breaks. The bad news is, the spaces are saved by findall, which isn’t really what we want.
slide 29 We could tidy this up after the fact using string.strip.
slide 30 But let’s modify the pattern instead.
slide 31 This modification solves half our problem.
slide 32 If you recall, ‘\s’ is an abbrevation for the set of whitespace characters, so these uses of ‘\s*’ match zero or more spaces immediately after the opening curly brace, or immediately before the closing one.
slide 33 However, the space after the ‘Y’ is still being returned in the matched text. Once again, the problem is that regular expressions are greedy: the space after the ‘Y’ isn’t a closing curly brace, so it’s matched by the negated character set, and included in the returned string. The ‘\s*’ that’s supposed to match the trailing space is then matched against zero characters instead of one. It’s not what we want, but it’s legal.
slide 34 OK, so let’s force our match to line up with the break from word to non-word characters using ‘\b’.
slide 35 It works!
slide 36 The change is to put ‘\b’ after the first unwanted spaces, and before the last ones. Since the curly braces around the citation labels are also non-word characters, the pattern matches even if there aren’t any opening or trailing spaces.
slide 37 The last hurdle is to handle multiple labels inside a single pair of curly braces.
slide 38 The pattern we’ve built so far doesn’t explode when there are two or more labels—it even handles spaces after the commas. But it returns all those labels as a single lump of text.
slide 39 We actually could write a pattern that would break everything up on commas, but it would need some very advanced features of the regular expression library, and would be very difficult to read.
slide 40 Instead, let’s use another basic function, re.split, to separate multiple labels. re.split does the same thing as string.split, but unlike its simpler cousin, it breaks things up everywhere that a pattern matches.
slide 41 The best way to show re.split in action is to write the function we originally set out to create. Let’s start with a skeleton that includes some test data, a function that does nothing (but doesn’t just fail), and a couple of lines that call that function and display the result. So far, so good.
slide 42 Now let’s write our function. For readability’s sake, we’ll put our patterns at the top and give them memorable names. Inside the function, we’ll pull out all the citations using the first pattern, then split each result everywhere there’s a comma with optional spaces before or after it. We’ll stuff all the results into a set, and return that; if no matches were found, that set will be empty.
slide 43 We can use one more trick from the regular expression library to make this function more efficient. Instead of turning the regular expression into a finite state machine over and over again, we can compile the regular expression and save the resulting object. That object has methods with the same names as the functions we’ve been using from the library, like search and findall, but if we’re using the same pattern over and over again, compiling it once and re-using the compiled object is much faster.
slide 44 As you can see, the changes required are very small: instead of saving the textual representations of our expressions, we compile them, and then instead of calling the top-level functions from the regular expression library, we call the methods of those saved objects.
slide 45 And here’s the result: a set of all the citations in our test data, pulled out with just a dozen lines of code.
slide 46 This is the end of our lecture on regular expressions. We’ll touch on a few other ideas in the exercises, but we hope these five episodes have shown you enough to help you solve some everyday data crunching problems.

  1. No comments yet.