Up: Sets and Dictionaries

Tuples

Note: this is the shorter and less detailed explanation of this material; if you would rather dive into the details, please see the longer and more detailed version.

slide 01 Hello, and welcome to the shorter version of the second episode of the Software Carpentry lecture on sets and dictionaries. In this episode, we’re going to explain why you cannot put a list in a set, and show you what you should do instead.
slide 02 Let’s start by trying an experiment.
slide 03 Create a set, add “a string” to it, and print the set. As expected, the string is in the set.
slide 04 Now let’s try adding a list to the same set.
slide 05 Whoops: why doesn’t that work?
slide 06 And what does that word “unhashable” mean?
slide 07 In order to understand what’s going on, we have to take a look at how sets are stored in a computer’s memory.
slide 08 When you create a set, the computer allocates a blob of memory to store references to the set’s elements.
slide 09 When you add something to the set, or try to look something up, the computer uses a hash function to figure out where to look. A hash function is any function that turns data values into a single integer that can be used as an index into another data structure.
slide 10 For example, let’s take a look at how a string is stored in a set.
slide 11 We’ll start with the string “zebra”.
slide 12 The string has five letters: ‘z’, ‘e’, ‘b’, ‘r’, and ‘a’.
slide 13 Each character is stored in memory as a small integer: 97 for lower-case ‘a’, 98 for lower-case ‘b’, and so on up to 122 for lower-case ‘z’.
slide 14 We can add up these integers to produce a number that will be the same for every copy of the string. This is our hash function.
slide 15 Once we have this integer, we can use it to figure out where in the hash table the string should be stored.
slide 16 Now let’s take a look at how a list would be stored, if we were allowed to store lists in sets.
slide 17 For the sake of this example, let’s assume that the list contains the same five characters.
slide 18 So it’s represented like this.
slide 19 For our hash function, we can add up the characters’ values as before.
slide 20 The final picture of what’s in memory looks similar to what we had when we stored a string.
slide 21 But what happens if we now change the contents of the list?
slide 22 Suppose, for example, that we change the first letter from ‘z’ to ‘s’.
slide 23 The hash function’s value is now 523 instead of 532.
slide 24 Which means that the modified list belongs in a different place in the set.
slide 25 This is bad—very bad. If we now ask, “Is the list containing ‘s’, ‘e’, ‘b’, ‘r’, and ‘a’ in the set S?” the answer will be “no”, because the reference to the list isn’t stored in the location that our hash function tells us to look. It’s as if we changed our name from “Tom Riddle” to “Lord Voldemort”, but left all the personnel records filed under ‘R’.
slide 26 This problem arises with any mutable structure, i.e., anything whose contents or value can be changed after its creation. Integers and strings are safe, since their values are fixed, but the whole point of lists is that we can grow them, shrink them, and overwrite their contents. What should we do?
slide 27 One option would be to have each list keep track of the sets that it is in, and move itself whenever its values change.
slide 28 However, this would be very expensive: every time our program touched a list, Python would have to recalculate its hash code and update all the sets that held references to the list.
slide 29 Option number two is to shrug our shoulders and say, “It’s the programmer’s fault.” This is what most languages do…
slide 30 …but it’s also very expensive. This time, though, the time that’s wasted is the programmer’s time, tracking down and fixing bugs.
slide 31 Python uses a third option. It only allows programmers to put immutable values in sets.
slide 32 After all, if something’s value can’t change, neither can its hash code, or its location in a hash table.
slide 33 In practice, this turns out to be a fairly minor restriction: occasionally annoying, but never a show-stopper.
slide 34 This is fine for basic types like integers and strings, which are immutable.
slide 35 But what do we do with values that naturally have several parts, like people’s names?
slide 36 Again, there are several options. The first is to concatenate those values somehow.
slide 37 For example, if we want to store “Charles” and “Darwin”, we’d create the string “Charles|Darwin”, and store that.
slide 38 We have to use a character like ‘|’ instead of something more natural, like a space, because if we join “Paul Antoine” and “St. Cyr” using a space, there would be three possible ways to split it apart again.
slide 39 Concatenating values is actually a pretty bad idea. First, we have to find a concatenator that can never come up in our data—essentially, make a bet on what’s going to happen in the future.
slide 40 Second, our code will wind up being littered with string joins and string splits, which will make it slower and harder to read.
slide 41 The second option—the right one—is to use tuples instead of lists.
slide 42 A tuple is just an immutable list…
slide 43 …i.e., a sequence of values that cannot be changed after its creation.
slide 44 Tuples are created exactly like lists…
slide 45 …except we use parentheses instead of square brackets.
slide 46 They are indexed the same way, too, and functions like len do exactly what you’d expect.
slide 47 But you cannot assign a new value to a tuple element, i.e., you cannot change the tuple after it has been created.
slide 48 This means that a tuple’s hash code never changes, and that means that tuples can be put in sets. We’ll see other uses of tuples in later lectures.
slide 49 Tuples in sets are very useful, but one thing beginners often trip over is that you cannot look up partial values in tuples.
slide 50 For example, there’s no way to say, “Give me all the tuples in this set that end with the name ‘Darwin’.”
slide 51 The next episode of this lecture will introduce another data structure that allows you to do something very much like this.
slide 52

  1. No comments yet.