Up: Python

Text

slide 001 Hello, and welcome to the thirteenth episode of the Software Carpentry lecture on Python. In this episode, we’ll step back from the language itself and ask, what is text data?
slide 002 Let’s start with a simple question: how should a computer represent single characters?
slide 003 For American English in the 1960s, the solution was simple:
slide 004 there are 26 characters, which have upper and lower case representations…
slide 005 …ten digits…
slide 006 …some punctuation…
slide 007 …and a few special “characters” for controlling the teletype terminals of the period (meaning “go to a new line”, “move back to the start of the line”, “start a new page”, “ring the bell”, and so on).
slide 008 There were fewer than 128 of these, so the ASCII committee standardized on an encoding that used 7 bits per character.
slide 009 Next question: how should text containing many characters be represented?
slide 010 The first choice, which was dictated by the punch card technology of the 1940s and 1950s, was to use fixed-width records, in which each line was exactly the same length.
slide 011 For example, a computer would lay out this haiku…
slide 012 …in three records as shown here (where the dot character means “unused”).
slide 013 This representation makes it easy to skip forward or backward by N lines, since each is exactly the same size…
slide 014 …but it may waste space…
slide 015 …and no matter what maximum length we choose, we’ll eventually have to deal with lines that are longer.
slide 016 Over time, most programmers switched to a different representation, in which text is just a stream of bytes, some of which mean “the current line ends here”.
slide 017 With this representation, our haiku would be stored like this, where the gray cells mean “end of line”.
slide 018 This is more flexible…
slide 019 …and wastes less space…
slide 020 …but skipping forward or backward by N lines is harder, since each one might be a different length…
slide 021 …and of course, we have to decide what to use to mark the ends of lines.
slide 022 Unfortunately, different groups picked different things. On Unix, the end of line is marked by a single newline character, which is written '\n'.
slide 023 On Windows, the end of line is marked with a carriage return followed by a newline, which is written '\r\n'.
slide 024 Most editors can detect and handle the difference, but it’s still annoying for programmers, who need to be able to handle both.
slide 025 Python tries to help by converting '\r\n' to '\n' when it’s reading data from a file on Windows, and converting the other way when it’s writing. This is the right behavior for text…
slide 026 …but if you’re reading an image, an audio file, or some other binary file that might just happen to have the numbers representing '\r' and '\n' after each other, you definitely don’t want this conversion to happen. To prevent it, you must open the file in binary mode.
slide 027 To do this, put the letter 'b' after the 'r' or 'w' when you call open, as shown here.
slide 028 Now, back to characters…
slide 029 ASCII is fine for the digit 2, the letter ‘q’, or a circumflex ‘^’, but how should we store ‘ĕ’, ‘β’, or ‘Я’?
slide 030 Well, 7 bits gives us the numbers from 0 to 127…
slide 031 …but an 8-bit byte can represent numbers up to 255, so why not extend the ASCII standard to define meanings for those “extra” 128 numbers?
slide 032 Unfortunately, everyone did, but in different and incompatible ways.
slide 033 The result was a mess: if a program assumed characters were encoded using Spanish rules when they were actually encoded in Bulgarian, what it got was gibberish.
slide 034 And setting that aside, many languages—particularly those of East Asia—use a lot more than 256 distinct symbols.
slide 035 The solution that emerged in the 1990s is called the Unicode standard.
slide 036 It defines integer values to represent thousands of different characters and symbols…
slide 037 …but does not define how to store those integers in a file, or as a string in memory.
slide 038 The simplest choice would be to switch from using an 8-bit byte for each character to using a 32-bit integer…
slide 039 …but that would waste a lot of space for alphabetic languages like English, Estonian, and Brazilian Portuguese.
slide 040 Despite this, 32 bits per character is actually used in memory, where access speed is important…
slide 041 …but most programs and programmers use something else when saving data to a file or sending it over the Internet.
slide 042 That something else is (almost) always an encoding called UTF-8, which uses a variable number of bytes per character.
slide 043 For backward compatibility’s sake, the first 128 characters (i.e., the old ASCII character set) are stored in one byte each.
slide 044 The next 1920 characters are stored using two bytes each, the next 61,000-odd in three bytes each, and so on.
slide 045 If you’re curious, the way this works is shown…
slide 046 …in…
slide 047 …this…
slide 048 …table…
slide 049 …but you don’t have to know or care.
slide 050 What you do have to know these days is that Python 2.* provides two kinds of strings.
slide 051 A “classic” string uses one byte per character, just as it always did.
slide 052 While a “Unicode” string uses enough memory per character to store any kind of text.
slide 053 Unicode strings are indicated by putting a lower-case ‘u’ in front of the opening quote.
slide 054 If we want to convert a Unicode string to a string of bytes, we must specify an encoding.
slide 055 You should always use UTF-8 unless you have a very, very good reason to do something else.
slide 056 And even then, you should think twice.

  1. No comments yet.
  1. No trackbacks yet.