Up: Sets and Dictionaries

Phylogenetic Trees

slide 001 Hello, and welcome to this episode of the Software Carpentry lecture on sets and dictionaries. In the next few minutes, we’re going to look at how to reverse engineer the process of evolution, and how dictionaries can make that simpler.
slide 002 You don’t have to look at the natural world very hard to realize that some organisms are more alike than others.
slide 003 For example, if you look at the appearance, anatomy, and lifecycles of these seven fish…
slide 004 You can see that these two are closely related…
slide 005 …as are these two…
slide 006 …and these two.
slide 007 But what about the seventh? Where does it fit in? And how do the pairs relate to each other.
slide 008 As Theodosius Dobzhansky said almost a century ago, nothing in biology makes sense except in the light of evolution.
Since mutations usually occur one at a time, the more similarities there are between the DNA of two species, the more recently they had a common ancestor.
We can use this idea to reconstruct the evolutionary tree for a group of organisms using a hierarchical clustering algorithm.
slide 009 The first step is to find the two species that are most similar, and construct their plausible common ancestor.
slide 010 We then pair two more…
slide 011 …and two more…
slide 012 …and start joining pairs to individuals…
slide 013 …or pairs with other pairs.
slide 014 Eventually, all the organisms are connected.
slide 015 We can redraw those connections as a tree…
slide 016 …using the heights of branches to show the number of differences between the species we’re joining up.
slide 017 Let’s turn this into an algorithm.
slide 018 Initially, our universe U contains all the organisms we’re interested in.
While there are still organisms that haven’t been connected to the tree,
we find the two that are closest,
calculate their common parent,
remove the two we just paired up from the set,
and insert the newly-created parent.
slide 019 The set of ungrouped organisms shrinks by one each time, so this algorithm eventually terminates.
slide 020 And we can keep track of the pairings on the side to reconstruct the tree when we’re done.
slide 021 Now, what does “closest” mean?
slide 022 One simple algorithm is called unweighted pair-group method using arithmetic averages.

http://www.nmsr.org/upgma.htm

slide 023 Let’s illustrate it by calculating a phylogenetic tree for humans, vampires, werewolves, and mermaids.
The distances between each pair of species is shown in this table.
(We only show the lower triangle because it’s symmetric.)
slide 024 The closest entries—i.e., the pair with minimum distance—is human and werewolf.
slide 025 We will replace this with a common ancestor, which we will call HW.
slide 026 Its height will be 1/2 the value of the entry.
slide 027 And for each other species X, we will calculate a new score for HW and X as (HX + WX – HW)/2.
slide 028 For example, we will combine HV and VW which are in yellow, and HM and MW which are green.
slide 029 Which leaves us with this new matrix.
slide 030 The height of HW is half of the 5 we eliminated, or 2.5.
slide 031 Repeating this step, we combine HW with V…
slide 032 …and finally HWV with M.
slide 033 Our final tree looks like this.
slide 034 And the ‘missing’ heights are implied by the differences between branch values.
slide 035 Right: how do we translate this into software?
slide 036 We illustrated our algorithm with a triangular matrix…
slide 037 …but the order of the rows and columns is arbitrary.
slide 038 It’s really just a lookup table mapping pairs of organisms to numbers.
slide 039 And as soon as we think of ‘lookup tables’, we should think of dictionaries.
slide 040 The keys are pairs of organisms.
slide 041 which we will keep in alphabetical order so that there’s no confusion between ‘HW’ and ‘WH’
slide 042 and the values are the distances between those organisms.
slide 044 …becomes this dictionary.
slide 045 Time to create some code. Let’s write out the algorithm we discussed earlier.
slide 046 While we have some scores left to process,
find the closest pair…
create a new parent…
print it out (or record it somewhere to use later)…
remove entries that refer to the pair we’re combining…
and add new entries to the scores table.
slide 047 This code assumes that the scores are in a dictionary called ‘scores’.
slide 048 It also assumes that the names of the species are in a list called ‘species’.
slide 049 And yes, it took us a couple of false starts to come up with this code: we would have shown you the blind alleys, but this episode is already rather long.
slide 050 Let’s start writing the functions our overall algorithm assumes.
The first is min_pair, which finds the closest pair of organisms in the scores table.
The algorithm is simple…
slide 051 …but it assumes we have a way to generate all valid combinations of organisms from the species list. We’ll need to write that.
slide 052 It’s also worth noting the assert statements, which check that the data we’re working with is sensible, and that we actually found a minimum value. Remember, good programs fail early and fail often.
slide 053 Now let’s write the function that generates pairs of species.
This uses a double loop to construct a list of pairs.
slide 054 Notice the starting index on the inner loop: if the outer loop is at position ‘i’, the inner loop starts at ‘i+1′, which ensures that each possible pair is only generated once.
slide 055 Here’s what our program looks like so far:
there’s a main body, and two functions find_min_pair and combos.
slide 056 Looking back at the program’s main body, the next function we need to write is ‘create_new_parent’.
It’s pretty simple: we just concatenate the names of the two organisms we’re combining.
The new score is just half the score for the pair we’re combining.
slide 057 We put square brackets around the name to make it more readable,
and so that we can easily see the order in which things were paired up.
slide 058 Three functions down: what’s next?
slide 059 Looking back at our main program, we need to remove entries from the scores table that refer to the organisms we’re pairing up.
Oh, and we need to take their names out of the list of species, too.
That’s pretty easy to do…
slide 060 …but notice that this function returns the old score, because we need it in the main program.
This isn’t a great design: there’s nothing in the name ‘remove_entries’ to suggest that we’re saving and returning the old score,
and it isn’t immediately obvious why we’re doing this.
As an exercise, try rewriting the main program so that ‘remove_entries’ doesn’t have to return anything,
and see if you think it’s easier to understand.
slide 061 So much for our fourth function…
slide 062 Our fifth updates the scores table and species list.
It’s actually pretty complicated:
for each species that isn’t being paired up, we have to…
slide 063 …calculate the combination of that species with the two halves of the new pair…
slide 064 …saving the scores for later use…
slide 065 …create a pair that includes the newly-constructed parent…
slide 066 …and save our work.
slide 067 When we’re done, we have to add the newly-created parent to the list of species…
slide 068 …and then there’s this call to sort the species list.
If we don’t do this, our program will produce a wrong answer.
After this episode is over, take a few minutes to review the code
and see if you can figure out why.
Again, this is an example of not very good design:
if it isn’t clear why something is being done where it is,
it should be moved or explained.
slide 069 Our program now has five functions and a main body.
slide 070 But we’re still not done: we need to write the ‘tidy_up’ function we referred to in ‘update’.
slide 071 It in turn assumes a function called ‘make_pair’ that combines a pair of species…
slide 072 …which simply constructs a tuple of the species’ names ordered alphabetically.
Remember, we’re ordering names so that each pair has a unique representation.
slide 073 Our two new functions fit into our program as shown here.
slide 074 We have nothing left to write, so let’s try it out.
If we run it with the four species we used as an example earlier,
the output is as shown here.
These are the same values we had in our example tree,
so we seem to be on the right track.
As exercises,
try writing unit tests for this code,
and then modify the program so that it displays the entire reconstructed tree.
Oh, and try to figure out why the ‘update’ function has to sort things,
or what symptoms you’ll see if you don’t do this.
slide 075 Thank you for listening.

  1. Terri Yu
    February 20th, 2011 at 01:29 | #1

    The original species list (before the algorithm starts running) also needs to be sorted.

  2. Terri Yu
    February 20th, 2011 at 01:36 | #2

    To fix the problem of remove_entries() returning a value, maybe one could have the create_parent() function also return the parent_score. So instead of parent, height = create_new_parent(scores, min_pair), have parent, parent_score, height = create_new_parent(scores, min_pair). It makes sense for the create_new_parent() function to return the parent score (the score for the two species that are combined to be the parent).

  1. March 25th, 2012 at 22:17 | #1