Up: Sets and Dictionaries

Storage

Note: this is the longer and more technical explanation of this material; if you would rather skip the details, please see the shorter and less detailed version.

slide 01 Hello, and welcome to the second episode of the Software Carpentry lecture on sets and dictionaries. In this episode, we’re going to take a look at a little bit of computer science theory—just enough to help you understand how sets work, and what their limitations are.
slide 02 Let’s start by trying an experiment.
slide 03 Create a set, add “a string” to it, and print the set. As expected, the string is in the set.
slide 04 Now let’s try adding a list to the same set.
slide 05 Whoops: why doesn’t that work?
slide 06 And what does that word “unhashable” mean?
slide 07 In order to understand what’s going on, we have to take a look at how sets are stored in a computer’s memory.
slide 08 We could use a list.
slide 09 To create a set, we’d just create an empty list.
slide 10 To check whether a value is in the set, we would then loop over the values in that list, returning True as soon as we found the one we were looking for, or False if we hit the end of the list without finding it.
slide 11 Adding a value to a set would work the same way: we’d loop over the values already in the list and return right away if the new value was already there, or append it to the end of the list if we didn’t find it.
slide 12 But how efficient would this implementation be?
slide 13 With N items in the set, these versions of in and add would take between 1 and N steps to run, depending on how quickly they found the item in question (if at all).
slide 14 With a bit of handwaving, we can say that the “average” time to do something with an N-element set would be N/2 steps.
slide 15 If we change the way we store sets, we can do a lot better—in fact, we can get the time down to a constant, no matter how big the set is.
slide 16 But to get this speedup, we have to accept a few constraints on our program.
slide 17 Let’s start with something simple: storing a set of integers.
slide 18 If the range of possible values in the set is small, and fixed, we could just use a list of Booleans.
slide 19 Here, True at index i means “the integer ‘i’ is in the set”, and False means it isn’t.
slide 20 But what if we want to store any integer at all? It isn’t practical to create a list of 232 Booleans.
slide 21 The solution is to use one of the great inventions of computer science, called a hash table. This is just a list of some fixed length, which we’ll call L.
slide 22 When we want to store an integer I, we calculate I mod L, and put it there.
slide 23 Remember, ‘%’ in Python is mod, the remainder operator. I mod L is always in the range from 0 to L minus 1, which is concidentally the range of legal indices for a list of length L, so this calculation always produces a valid index.
slide 24 Here’s an example. Our list has five slots, with indices from 0 to 4. Since 3378 mod 5 is 3, we store 3378 in the third place in our list. 1625 goes in slot zero (since 1625 mod 5 is 0), and 101 goes in slot 1.
slide 25 So far, so good: to add or find a value, we do one simple bit of arithmetic, and look in exactly one place, regardless of how big the set is.
slide 26 But what do we do when there’s a collision?
slide 27 For example, if we want to add 206 to our set, it ought to go in position 1, but that’s already occupied by 101.
slide 28 There are basically two ways to handle this. The first is to search forward from the location the value is supposed to be in until we find an empty slot, and store the value there.
slide 29 The second is to store a list of values in each slot.
slide 30 Each approach has pros and cons that we won’t go into here, and each works well enough until the hash table is about 3/4 full.
slide 31 After that, the time to look up values, or insert new ones, starts to climb rapidly.
slide 32 When this happens, we can get the time back down to a constant (or close to a constant) by enlarging the table.
slide 33 For example, here’s our size-5 table with three values in it.
slide 34 Adding another element would take us over the magic 3/4 threshold…
slide 35 …so instead we’ll resize the hash table so that it has nine slots…
slide 36 …and then insert 206. In essence, we’re spending memory to save time, a tradeoff that comes up again and again in program design.
slide 37 All right, we can store integers. What about strings? Where should they go?
slide 38 To find out, we’ll use a hash function to generate an integer index from the characters in the string. Our function will always produce the same value for any particular string, which means we’ll always know where to look for that string in our hash table.
slide 39 Let’s start with the string “zebra”.
slide 40 It consists of the five characters ‘z’, ‘e’, ‘b’, ‘r’, and ‘a’.
slide 41 Each character is stored in memory as a small integer: 97 for lower-case ‘a’, 98 for lower-case ‘b’, and so on up to 122 for lower-case ‘z’.
slide 42 We can add up these integers to produce a number that will be the same for every copy of the string “zebra”.
slide 43 And once we have this integer, we can use mod as before to figure out where in the hash table the string should be stored.
slide 44 In general, if we can define a hash function for something—i.e., if we can figure out how to turn that “something” into an integer—we can store it in a hash table.
slide 45 But this only works as long as nothing changes behind our backs, which is why we got that strange “unhashable” error message at the start of this episode.
slide 46 Here’s a picture of what’s in memory when we put “zebra” in a hash table. The string’s hash code is 532, and the hash table’s length is 5, so since 532 mod 5 is 2, we put a reference to the string in location 2 in the hash table.
slide 47 Let’s see what happens if we try to put a list in the hash table instead.
slide 48 We’ll start with a list containing the same five characters ‘z’, ‘e’, ‘b’, ‘r’, and ‘a’.
slide 49 Our list is represented in memory like this…
slide 50 …so we add up the characters’ values, take the remainder mod 5…
slide 51 …and put a reference to the list in location 2 in the hash table.
slide 52 This is what’s in memory when we’re done.
slide 53 What happens if we now change the values in the list?
slide 54 We start with this…
slide 55 …then change the first character in the list from ‘z’ to ‘s’.
slide 56 If we recalculate the hash code, this list should be put in location 0…
slide 57 …but it’s actually still in location 2. Our list is the wrong place!
slide 58 This is bad—very bad. If we go looking for the list ['s','e','b','r','a']
slide 59 …we’ll look in location 0 rather than location 2, and get the result False when we should get True.
slide 60 And if we go looking for the original list ['z','e','b','r','a']
slide 61 …we’ll look in slot 2, and either get the wrong answer True
slide 62 …or blow up completely, since there’s a value there, but it’s not the one we asked about.
slide 63 This problem arises with any mutable structure, i.e., anything whose contents or value can be changed after its creation. Integers and strings are safe, since their values are fixed, but the whole point of lists is that we can grow them, shrink them, and overwrite their contents. What should we do?
slide 64 One option would be to have each list keep track of the sets that it is in, and move itself whenever its values change.
slide 65 However, this would be very expensive: every time our program touched a list, Python would have to recalculate its hash code and update all the sets that held references to the list.
slide 66 Option number two is to shrug our shoulders and say, “It’s the programmer’s fault.” This is what most languages do…
slide 67 …but it’s also very expensive. This time, though, the time that’s wasted is the programmer’s time, tracking down and fixing bugs.
slide 68 Python uses a third option. It only allows programmers to put immutable values in sets.
slide 69 After all, if something’s value can’t change, neither can its hash code, or its location in a hash table.
slide 70 In practice, this turns out to be a fairly minor restriction: occasionally annoying, but never a show-stopper.
slide 71 But if we can’t store lists in sets, what do we do with values that naturally have several parts, like people’s names?
slide 72 Again, there are several options. The first is to concatenate those values somehow.
slide 73 For example, if we want to store “Charles” and “Darwin”, we’d create the string “Charles|Darwin”, and store that.
slide 74 We have to use a character like ‘|’ instead of something more natural, like a space, because if we join “Paul Antoine” and “St. Cyr” using a space, there would be three possible ways to split it apart again.
slide 75 Concatenating values is actually a pretty bad idea. First, we have to find a concatenator that can never come up in our data—essentially, make a bet on what’s going to happen in the future.
slide 76 Second, our code will wind up being littered with string joins and string splits, which will make it slower and harder to read.
slide 77 The second option—the right one—is to use tuples instead of lists.
slide 78 A tuple is just an immutable list…
slide 79 …i.e., a sequence of values that cannot be changed after its creation.
slide 80 Tuples are created exactly like lists…
slide 81 …except we use parentheses instead of square brackets.
slide 82 They are indexed the same way, too, and functions like len do exactly what you’d expect.
slide 83 But you cannot assign a new value to a tuple element, i.e., you cannot change the tuple after it has been created.
slide 84 This means that a tuple’s hash code never changes, and that means that tuples can be put in sets. We’ll see other uses of tuples in later lectures.
slide 85 Let’s step back for a moment. This lecture has been about the “science” in computer science.
slide 86 Things like the design of hash tables…
slide 87 …and the tradeoffs between mutability, usability, and performance.
slide 88 It’s a lot to digest in one sitting…
slide 89 …but sometimes the only way to understand why things work the way they do is to understand the theory they’re based on.
slide 90 We promise to get back to practicalities in the next episode.

  1. No comments yet.