Regular Expressions
These episodes introduce regular expressions, a powerful set of tools for manipulating text.
Requires: data processing patterns
Introduces: regular expressions
Problem: Converting More Complex Data Files
Your supervisor wants you to convert some measurements of background evil levels in the Shire after the explosion of the Death Star into a uniform representation. The problem is, those files were typed in by hand by several different graduate students, and are in several slightly different formats. For example, some files use commas as field separators, while others use spaces, and still others use a mix of both. Similarly, some record the dates of observations as “2003-08-05″, while others use “Aug 5, 2008″, and so on.
- Introduction (pdf, ppt)
- Motivating problem
- Matching with ‘or’
- Precedence
- Extracting data with groups
- Wildcards: ‘.’
- Operators (pdf, ppt)
- Procedural vs. declarative programming
- Zero or more: ‘*’
- One or more: ‘+’
- Zero or one: ‘?’
- Enumerated matches: ‘{M,N}’
- Character sets: ‘[...]‘
- Mechanics (pdf, ppt)
- Finite state machines
- Limits of regular expressions
- Patterns (pdf, ppt)
- Using multiple regular expressions together
- Escape sequences: ‘\’
- Translation from text to string to regular expression
- Abbreviations: ‘\s’, ‘\d’, ‘\w’, ‘\S’, and ‘\W’
- Pseudo-characters: ‘^’, ‘$’, and ‘\b’
- More Tools (pdf, ppt)
- New example: extracting citation labels
- Negating character sets with ‘[^...]‘
- Using
re.findallinstead ofre.search - Using
re.split - Compiling regular expressions
Reading
- Practical Programming
- Regular Expressions Cookbook
- Mastering Regular Expressions
- Data Crunching: Solve Everyday Problems Using Java, Python, and More