Up: Sets and Dictionaries

Introduction

slide 01 Hello, and welcome to the first episode of the Software Carpentry lecture on sets and dictionaries, two ways of storing data that can make many programs both simpler and faster. In this episode we’ll have a look at sets, which are the simpler of the two.
slide 02 Most programmers meet lists (or arrays) early in their careers, and use them for almost everything thereafter. However, they are not the only way to store multiple values in a program.
slide 03 In mathematics, you’re far more likely to encounter sets, and these come up in many algorithms as well.
slide 04 A set is an unordered collection of distinct items.
slide 05 The word “collection” means that a set can hold zero or more values.
slide 06 The word “distinct” means that any particular value is either in the set or not: a set can’t store two or more copies of the same thing.
slide 07 And finally, “unordered” means that values are simply “in” the set. They’re not in any particular order, and there’s no first value or last value.
slide 08 This third part is what novices trip over the most: the first few times people use sets, they tend to expect that values are stored in alphabetical or numerical order, or in the order they were added to the set, but they’re not.
slide 09 Sets were a relatively late addition to Python: the language was already ten years old when they put in.
slide 10 That’s still better than most languages, though. Despite how often sets come up in various algorithms, most languages still don’t provide them as a built-in type.
slide 11 If you’re using Python 2.6 or earlier, you create a set by calling set and passing it a list of the values you want in the set.
slide 12 In version 2.7 or later, you can simply write down the set as you would in a math class.
slide 13 However, in both versions, you have to call set to create an empty set…
slide 14 …because when sets were added to the language, empty curly braces—the mathematical notation for an empty set—were already being used for something else (which we’ll see in a couple of episodes).
slide 15 We’ll use the Python 2.7 notation—the more natural one—in this lecture.
slide 16 As I said a moment ago, the values in a set are distinct, i.e., any value is simply present or absent. This means that sets can easily be used to get rid of duplicate values.
slide 17 For example, suppose we want to find out what letters are used in a word. If we create a set, then add the characters in the word to the set one by one, the set will automatically discard duplicates. When we print the set out, what we get are the unique values.
slide 18 Notice that these values aren’t printed in alphabetical order, or in the order they were added to the set.
slide 19 This is because set elements are not ordered, and it’s important to keep this in mind when using them.
slide 20 Here’s a much shorter way to find the distinct letters in a word:
slide 21 Just create a set, passing the word in as an argument.
slide 22 This works because Python knows how to loop over the characters in a string the same way that it loops over the items in a list. In general, if you can loop over it, you can build a set from it.
slide 23 What you can’t do is build a set from several distinct values in one go.
slide 24 Many novices try to build a set like this, but Python insists that all the initial values be in a single container that can be looped over, like a list or string.
slide 25 To show you what we can do with sets, let’s create three holding the integers 0 through 9, the first half of that same range of numbers (0 through 4), and the odd values 1, 3, 5, 7, and 9. Notice that in the first case, the function range is returning a list of ten integers, which is passed directly into set to create the set we want.
slide 26 We’ve already seen how to add elements using set’s add method.
slide 27 We can also empty out a set by calling its clear method. Just like the methods of strings and lists, we call a method using object dot method name instead of a “naked” function name.
slide 28 We can use other methods to find the difference between two sets…
slide 29 …or their intersection. In each case, the method creates and returns a new set instead of modifying either the set the method is called for, or the one passed in as an argument.
slide 30 issubset doesn’t produce a new set. Instead, it just reports whether all the elements in one set are present in another. Notice the order: lows.issubset(ten) checks whether lows is a subset of ten, not the other way around.
slide 31 Of course, the complementary method issuperset also exists, and does what you’d expect.
slide 32 Now let’s remove the zero from lows, leaving the set of integers 1, 2, 3, and 4. This is similar to deleting an element from a list, but there’s an important difference. When deleting from a list, you specify the location of the element to delete. When deleting from a set, you must specify the value you want to take out. If that value isn’t in the set, this method does nothing.
slide 33 Here’s another method that creates a new set. symmetric_difference is sometimes called “exclusive or”; it returns the values that are in one set or another, but not in both. It’s useful from time to time…
slide 34 …but you’ll probably use plain old union more often.
slide 35 Finally, you can count how many values are in a set using len, just as you would to find the length of a list or string…
slide 36 …and check whether a particular value is in the set or not using in.
slide 37 To help make programs eaiser to type and read, most of the methods we’ve just seen can be written using arithmetic operators as well. For example, instead of lows.issubset(ten), we can write lows <= ten, just as we would if we were using pen and paper. There are even a couple of operators, like less than <, that don’t have long-winded equivalents.
slide 38 One operator that isn’t in this list is “not”.
slide 39 Mathematicians are quite comfortable negating sets: for example, the negation of “the set {1, 2}” is “all integers that aren’t 1 or 2″.
slide 40 This is a lot harder to do in a program, though. To continue with our example, is the string “pterodactyl” in the negation of the set {1, 2} or not?
slide 41 We’ll take another look at this problem when we get to classes and object-oriented programming in a later lecture.
slide 42 Right now, let’s use sets to clean up some field data.
slide 43 We have two files. The first has the names of all the birds our supervisor considers uninteresting from a research point of view.
slide 44 The second contains the names of all the birds we saw during a three-week field trip in a mosquito-infested hellhole in northern Ontario.
slide 45 Our task is to copy those observations, removing the uninteresting birds as we do so.
slide 46 This is the main body of our program.
slide 47 We start by reading in the set of birds we’re supposed to be subtracting from our observations. That’s going to take a few lines of code, so for now, we’ll pretend we can do this by calling a function called read_set, which we’ll come back and write in a minute or two.
slide 48 Next, we open our input file for reading, and our output file for writing.
slide 49 We then read lines from the input file one at a time and strip off any leading or trailing whitespace. In particular, line.strip will remove the newline and/or carriage return characters at the ends of the lines we’re reading in, so that we just have the names of the birds.
slide 50 These two lines are the heart of our program. If the bird name we just read is not in the set of names we’re supposed to ignore, we print it out. Everything else in our program is really just there to support these two lines—a realization that we’ll return to later when we talk about program design.
slide 51 Finally, once we’ve read and filtered all of our input, we close the input and output files.
slide 52 So far, so good: all that’s left is to write the function that reads in the set of names we’re supposed to ignore.
slide 53 We start by creating an empty set, which we will then fill up with bird names, and open the input file.
slide 54 We then read lines from the input one at a time, stripping off whitespace…
slide 55 …and put the strings left over in the set.
slide 56 When we’re done, we return the set that we created.
slide 57 Stepping back, the main program and the read_set function have almost exactly the same structure. Each one initializes some data, opens a file or files, loops over input data, does something with that data, then closes the files it’s using. We’ll have a closer look at this pattern, and how we can save ourselves writing it over and over again, in a later lecture.
slide 58 Finally, here’s a few tests of our program. If there’s nothing to remove, our output file is the same as our input. If we’re removing just one name, it’s subtracted as it should be, while if the name of every bird we saw is in the removals list, the output is empty. These three tests don’t exhaust all the possibilities, but the odds are pretty good that if a program passes them, it’s doing the right thing.
slide 59

  1. Ross Dickson
    July 13th, 2010 at 19:24 | #1

    Your read_set is buggy:

    def read_set(filename):
    ”’Read set elements from a file.”’
    result = set()
    reader = open(filename, ‘r’)
    for line in reader: # Here…
    line = line.strip()
    result.add(line) # …and here
    reader.close()
    return result

  2. October 6th, 2010 at 22:48 | #2

    I don’t see a way to go directly to the next part of the lecture. Am I missing something? If not, it would be nice to add it.

  3. Terri Yu
    February 19th, 2011 at 17:41 | #3

    In the discussion of the remove() method for sets, the slide says “When deleting from a set, you must specify the value you want to take out. If that value isn’t in the set, this method does nothing.” Actually, according to the Python documentation:

    remove(elem)
    Remove element elem from the set. Raises KeyError if elem is not contained in the set.

    So if the value isn’t in the set, the method raises an exception. It doesn’t do “nothing.”

  4. Terri Yu
    February 19th, 2011 at 18:41 | #4

    To delete an element from a set without worrying about whether the element is actually in the set, one can use the discard() method. From the Python documentation:

    discard(elem)
    Remove element elem from the set if it is present.