Up: Data

Data Management

slide 000 Hello, and welcome to our Software Carpentry lecture on data management. A lot of inspiration and ideas were taken from William Stafford Noble’s A Quick Guide to Organizing Computational Biology Projects.
slide 001 In this lecture, we’ll discuss how best manage multiple versions of data files, since the same method that you use for source code probably doesn’t work well for your data as well.
slide 002 In my experience, data file management usually goes something like the following. First, you have one data file. All is well with the world.
slide 003 Then, you come out with a revision. Some people directly overwrite the old data, but eventually you’ll want to compare the new and old data or you’ll find out the new data actually isn’t correct and the old data is better or something will come up and you’ll wish you hadn’t deleted it.
slide 004 You know this, so rather than overwriting the old file, you add “old” to the name and put the new file in the same spot the old one was. This is still pretty clear, but hopefully your Spidey-sense is warning you of danger.
slide 005 Inevitably, you come out with another version. At this point it’s clear what to do. You’ve already set a precedent, so you go along with it.
slide 006 One problem, from a scientific perspective, is that any experiment that was run on “samples.mat”, if re-run, will now get a different result. Further, which old “samples.mat” was it run on? Once you have more than one “old” version, it all gets worse.
slide 007 More than should be the case, a file’s path and name is very much its identity, especially as far as other scripts (and links) are concerned.
slide 008 As a result, scientists (and website developers) tend not to rename or move existing files, but instead pick up a habit that is almost as bad.
slide 009 Which one is the most recent? Well, you guess.
slide 010 You could look at the data. Or you could check the last modified date, but if you’re going to do that, don’t ever move or copy the data.
slide 011 This sounds like a job for…
slide 012 version control.
slide 013 Or is it?
slide 014 Why shouldn’t we just use Subversion to manage our data files? Subversion is good for keeping track of simultaneous modification, archiving, merging, branching, diffing…
slide 015 What makes sense for data?
slide 016 Well, archiving.
slide 017 We really just want a clear, stable way to archive data so that we can find and access it
slide 018 Despite the rapidly decreasing cost of storage, it is still possible to run out of disk space. In my lab, we can easy go through 2 TB/month, if we’re not careful. Since version control tools usually store revisions in terms of lines, with binary data files, they end up essentially storing every revision separately. This isn’t that bad (it’s what we’d be doing anyway), but it means version control isn’t doing what it likes to do, and the repository can get very large very quickly. Another concern is that, in the event that very old data will no longer be used, it can be nice to zip/tar/or straight-up delete old data files. This is not possible if your data is version controlled. Your repository will only increase in size.
slide 019 The solution most people agree on is to set up a directory structure in a way that keeps different versions of the data separate, while allowing you to track down and use old versions of the data. For example, a lot of people will move all the old data into a subdirectory to archive it and always keep the latest version right into the top (with all the problems noted before):
slide 020 Or else they will, realizing the problems of the above, create subdirectories for all the new versions, but leave the oldest version in the root.
slide 021 But there are many problems with this. We can do better.
slide 022 A much better approach is to, from the start, add a chronological layer to our data. Structuring folder names this way (yyyy-mm-dd) ensures that they are always printed in chronological order. Also, scripts and UNIX commands can all take advantage of the same number of levels: summarize_samples data/*/samples.mat vs: summarize_samples data/samples.mat data/*/samples.mat and often: find data -type f -name “samples.mat” | xargs summarize_samples
slide 023 But we are missing something critical. What if we give data access to a collaborator, a new student joins the project, or it’s three years later and we have forgotten when the various stages of the project happened?
slide 024 We need context, and we usually add it with some metadata (even better is to have a notebook as well, but it should also be in metadata).
slide 025 Metadata records who the data is from, when was it generated, what were the experimental conditions, and so on. We could put a header inside the file—we discuss that in our essay on provenance—but this doesn’t work too well with binary files. We could create a separate metadata file for each data file, but this can quickly get out of sync or out of hand.
slide 026 I think of metadata files as information kiosks. You place a small number in strategic places so that people who are lost can find what they want. For a more scientific description, I use them as flat databases to list various attributes that I might be interested in searching by later.
slide 027 For example, the previous approach can be improved by adding a README file.
slide 028 If you have more than a few data files, or you are downloading data files from various websites, it quickly becomes critical to add metadata files to describe all the files in each directory, as well: where they came from, their versions, and any other important information.
slide 029 Because these metadata files undergo many changes, might be edited and revised by multiple people at a time, and are usually line-based text files, they are perfect candidates for version control.
slide 030 Then, even if you later remove data files for space reasons…
slide 031 …you can leave the metadata files that describe the data that had been there, and the exact steps that one should take to reacquire it (if that is possible).
slide 032 A useful layout for a project is to have a data directory with chronological subfolders, and metadata version controlled.
slide 033 All scripts and source code go into a separate branch, where everything is version controlled.
slide 034 And then the results and experiments are chronologically recorded in a different directory, where nothing is version controlled.
slide 035 These techniques all revolve around the idea of sensible archiving, because even if you’re the only one working on a project, you’ll forget the details of what you’re working on several years later.
slide 036 For example, it had been a year since I worked on a project and a colleague asked me precisely how I generated a data file.
slide 037 I found that having the following directory structure helped immensely. The README listed where I downloaded “filenameExactlyAsDownloaded.tkq” from, when I downloaded it, and the version of it, and “get_targets.sh” listed all the commands (including the “wget” commands to download “otherNecessaryFile.dat”) I entered to create targets.gff from filenameExactlyAsDownloaded.tkq. Rerunning “get_targets.sh” would recreate the “targets.gff” file exactly, and looking at it shows the precise parameters used in the process.
slide 038 I was able to say exactly how I had generated targets.gff, and to easily try to recreate it from the raw data files.
slide 039 A final point is that not all projects lend themselves to this exact directory structure, and that thinking hard about the directory structure at the beginning can save huge amounts of pain and suffering down the road.
slide 040 What we’ve talked about so far is extremely useful for a “bag-of-experiments” sort of project.
slide 041 But what if our project is a pipeline with variations at each level, like the following.
slide 042 One solution I have seen is to do something like the following. This works reasonably well if your pipeline is relatively small and you usually run all of it. It suffers from the fact that it isn’t obvious the order in which the steps are done, but we could document that in a README file.
slide 043 It’s not uncommon to have more than one way of doing a step in a pipeline, but we’ll quickly run into trouble with many iterations of a more complex pipeline. The problem is that this directory structure doesn’t capture the dependency structure of the process.
slide 044 Since the data at one point in a pipeline is dependent upon the configurations of all the previous steps, this can be represented cleanly and robustly with the following organization.
slide 045 To summarize, please think hard at the beginning of your project about how you are going to organize your data as it grows. Something that works well for one file, or for two files, won’t necessarily work well for a hundred files. Second, version control your metadata, not your data files—use a conventional backup system for those. Third, an intelligent structure not only makes your data easier to archive, track, and manage, but it also reduces the chance that the paths in the pipeline get crossed and the data out the end isn’t what you think it is. That can save you embarrassment as well as time.
slide 046 Thank you for listening.

  1. KE
    January 26th, 2012 at 21:05 | #1

    Fabulous. Great suggestions.

  1. No trackbacks yet.