Up: Regular Expressions

Mechanics

slide 01 Hello, and welcome to the third episode of the Software Carpentry lecture on regular expressions. In this episode, we’ll take a look at the mechanics of regular expressions, i.e., what the computer actually does when it’s matching a regular expression against a piece of text. You don’t need to know this in order to do a lot with regular expressions, but if you are composing or debugging complex REs, it helps to understand what’s going on behind the scenes.
slide 02 If you recall, we have some notebooks with recordings of background evil levels in millivaders, taken in the Shire a couple of years after the Death Star exploded. The data in these notebooks is in different formats…
slide 03 …and we’re using regular expressions to parse them.
slide 04 In the last episode, one of the regular expressions we came up with was this complicated beast: '(.+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)' It matches one or more characters followed by a literal slash ‘/’, followed by a single upper-case character and one or more lower-case characters, followed by a space, then one or two digits, an optional comma, another space, exactly four digits, another literal slash, and one or more characters at the end. That’s a pretty complex match, and you might be asking yourself…
slide 05 How does it do it?
slide 06 The answer is that regular expressions are implemented using finite state machines.
slide 07 Here’s a very simple finite state machine that matches exactly, and only, the single character “lower case a”.
slide 08 We start matching with the incoming arrow on the left. It takes us to the first state in our finite state machine.
slide 09 The only way to get from there to the second state is to match exactly one character, because the arc between state 1 and state 2 is labelled with the character ‘a’.
slide 10 The second state is marked with a dot in the middle, which means it’s an end state. We must be in one of these states at the end of our match in order for the match to be valid.
slide 11 So now we have a finite state machine that matches the very simple regular expression ‘a’. Let’s see if we can do something a little more interesting.
slide 12 Here’s a finite state machine that matches one ‘a’ followed by zero or more ‘a’s.
slide 13 The first ‘a’ gets us from an initial state to an end state. But we don’t have to stop there: the curved arc at the top allows us to match another ‘a’, and brings us back to the same state.
slide 14 We can then match another ‘a’, and another ‘a’, and so on indefinitely.
slide 15 Note that we don’t have to stop in the end state the first time we reach it: we just have to be in an end state when we run out of input.
slide 16 What is the pattern? It’s ‘a+’: one ‘a’, followed by zero or more other ‘a’s, is the same as one or more ‘a’s.
slide 17 Here’s a finite state machine that matches against the letter ‘a’ or nothing.
slide 18 The top arc isn’t marked, so that transition is free. We can go from the first state to the second state without consuming any of our input.
slide 19 This is “a or nothing”…
slide 20 …which is the same as ‘a?’, i.e., the optional character ‘a’.
slide 21 So now we have three patterns in our library.
slide 22 This regular expression looks like the one that matches one or more ‘a’s…
slide 23 …except there’s an extra arc to get us from the first state to the second without consuming any input.
slide 24 So this will match ‘a*’, i.e., nothing at all (taking that free transition from the first state to the second) or one or more ‘a’s.
slide 25 As you can see, we’re building up a library of more and more complex patterns. The tools we’ve seen so far are enough to implement most of the regular expressions we’ve encountered in the previous two episodes.
slide 26 Take a look, for example, at this finite state machine. What is the corresponding regular expression? We can either take the top route or the bottom. The top route is ‘a+’; the bottom route is a ‘b’, followed by either a ‘c’ or a ‘d’.
slide 27 So this is 'a+|(b(c|d))'. An input string that matches any of these paths will leave us in that final end state.
slide 28 The most important thing about a finite state machine is that the action it takes at a node depends on only…
slide 29 …the arcs out of that node…
slide 30 …and the characters in the target data.
slide 31 Finite state machines have no memory: they do not keep track of how they got into a state. All decision-making is local to that state.
slide 32 This means that there are many kinds of data that regular expressions cannot match. For example, it is impossible to write a regular expression to check if nested parentheses match.
slide 33 If you want to see whether ‘(((…)))’ is balanced, you need some kind of memory…
slide 34 …or at least a counter, and there isn’t any memory in a finite state machine.
slide 35 Similarly, if you want to check whether a word contains each vowel only once, the only way to do it is to write a regular expression with 120 clauses, because you need to check for each possible permutation of ‘aeiou’ independently. You cannot write a regular expression that matches an arbitrary vowel, and then subtracts that vowel from the set of vowels yet to be matched, because once again, that would require some kind of memory, and finite state machines don’t have any.
slide 36 If regular expressions are limited this way, why do we use them? There are two answers.
slide 37 The first is, they’re really fast.
slide 38 After you do some pre-calculation—essentially, after you compile the regular expression to create a finite state machine—a regular expression can be matched against input by looking at each input character only once. That means the execution time only grows with the size of the data. The time required for many other matching techniques grows much faster than the size of the input data. So, while regular expressions are limited, the things they do, they do very, very well.
slide 39 Also, regular expressions are readable. Going back to our previous example, you might not think so,
slide 40 But imagine writing out a piece of code that matched that same input. The regular expression is much easier to read, and probably much more efficient, than its procedural equivalent.
slide 41 Regular expressions can do a lot more than the simple things we’ve seen so far.
slide 42 We’ll explore their power in the next episode.

  1. Terri Yu
    March 7th, 2011 at 22:52 | #1

    This was fun. Finite state machines are a neat topic. Thank you.