Data Management: Data Management

Data Management/Data Management at YouTube

Hello, and welcome to our Software Carpentry lecture on data management. A lot of inspiration and ideas were taken from William Stafford Noble's A Quick Guide to Organizing Computational Biology Projects.

In this lecture, we'll discuss how best manage multiple versions of data files, since the same method that you use for source code probably doesn't work well for your data as well.

In my experience, data file management usually goes something like the following. First, you have one data file. All is well with the world.

Then, you come out with a revision. Some people directly overwrite the old data, but eventually you'll want to compare the new and old data or you'll find out the new data actually isn't correct and the old data is better or something will come up and you'll wish you hadn't deleted it.

You know this, so rather than overwriting the old file, you add "old" to the name and put the new file in the same spot the old one was. This is still pretty clear, but hopefully your Spidey-sense is warning you of danger.

Inevitably, you come out with another version. At this point it's clear what to do. You've already set a precedent, so you go along with it.

One problem, from a scientific perspective, is that any experiment that was run on "samples.mat", if re-run, will now get a different result. Further, which old "samples.mat" was it run on? Once you have more than one "old" version, it all gets worse.

More than should be the case, a file's path and name is very much its identity, especially as far as other scripts (and links) are concerned.

As a result, scientists (and website developers) tend not to rename or move existing files, but instead pick up a habit that is almost as bad.

Which one is the most recent? Well, you guess.

You could look at the data. Or you could check the last modified date, but if you're going to do that, don't ever move or copy the data.

This sounds like a job for…

version control.

Or is it?

Why shouldn't we just use Subversion to manage our data files? Subversion is good for keeping track of simultaneous modification, archiving, merging, branching, diffing…

What makes sense for data?

Well, archiving.

We really just want a clear, stable way to archive data so that we can find and access it

Despite the rapidly decreasing cost of storage, it is still possible to run out of disk space. In my lab, we can easy go through 2 TB/month, if we're not careful. Since version control tools usually store revisions in terms of lines, with binary data files, they end up essentially storing every revision separately. This isn't that bad (it's what we'd be doing anyway), but it means version control isn't doing what it likes to do, and the repository can get very large very quickly. Another concern is that, in the event that very old data will no longer be used, it can be nice to zip/tar/or straight-up delete old data files. This is not possible if your data is version controlled. Your repository will only increase in size.

The solution most people agree on is to set up a directory structure in a way that keeps different versions of the data separate, while allowing you to track down and use old versions of the data. For example, a lot of people will move all the old data into a subdirectory to archive it and always keep the latest version right into the top (with all the problems noted before):

Or else they will, realizing the problems of the above, create subdirectories for all the new versions, but leave the oldest version in the root.

But there are many problems with this. We can do better.

A much better approach is to, from the start, add a chronological layer to our data. Structuring folder names this way (yyyy-mm-dd) ensures that they are always printed in chronological order. Also, scripts and UNIX commands can all take advantage of the same number of levels: summarize_samples data/*/samples.mat vs: summarize_samples data/samples.mat data/*/samples.mat and often: find data -type f -name "samples.mat" | xargs summarize_samples

But we are missing something critical. What if we give data access to a collaborator, a new student joins the project, or it's three years later and we have forgotten when the various stages of the project happened?

We need context, and we usually add it with some metadata (even better is to have a notebook as well, but it should also be in metadata).

Metadata records who the data is from, when was it generated, what were the experimental conditions, and so on. We could put a header inside the file—we discuss that in our essay on provenance—but this doesn't work too well with binary files. We could create a separate metadata file for each data file, but this can quickly get out of sync or out of hand.

I think of metadata files as information kiosks. You place a small number in strategic places so that people who are lost can find what they want. For a more scientific description, I use them as flat databases to list various attributes that I might be interested in searching by later.

For example, the previous approach can be improved by adding a README file.

If you have more than a few data files, or you are downloading data files from various websites, it quickly becomes critical to add metadata files to describe all the files in each directory, as well: where they came from, their versions, and any other important information.

Because these metadata files undergo many changes, might be edited and revised by multiple people at a time, and are usually line-based text files, they are perfect candidates for version control.

Then, even if you later remove data files for space reasons…

…you can leave the metadata files that describe the data that had been there, and the exact steps that one should take to reacquire it (if that is possible).

A useful layout for a project is to have a data directory with chronological subfolders, and metadata version controlled.

All scripts and source code go into a separate branch, where everything is version controlled.

And then the results and experiments are chronologically recorded in a different directory, where nothing is version controlled.

These techniques all revolve around the idea of sensible archiving, because even if you're the only one working on a project, you'll forget the details of what you're working on several years later.

For example, it had been a year since I worked on a project and a colleague asked me precisely how I generated a data file.

I found that having the following directory structure helped immensely. The README listed where I downloaded "filenameExactlyAsDownloaded.tkq" from, when I downloaded it, and the version of it, and "get_targets.sh" listed all the commands (including the "wget" commands to download "otherNecessaryFile.dat") I entered to create targets.gff from filenameExactlyAsDownloaded.tkq. Rerunning "get_targets.sh" would recreate the "targets.gff" file exactly, and looking at it shows the precise parameters used in the process.

I was able to say exactly how I had generated targets.gff, and to easily try to recreate it from the raw data files.

A final point is that not all projects lend themselves to this exact directory structure, and that thinking hard about the directory structure at the beginning can save huge amounts of pain and suffering down the road.

What we've talked about so far is extremely useful for a "bag-of-experiments" sort of project.

But what if our project is a pipeline with variations at each level, like the following.

One solution I have seen is to do something like the following. This works reasonably well if your pipeline is relatively small and you usually run all of it. It suffers from the fact that it isn't obvious the order in which the steps are done, but we could document that in a README file.

It's not uncommon to have more than one way of doing a step in a pipeline, but we'll quickly run into trouble with many iterations of a more complex pipeline. The problem is that this directory structure doesn't capture the dependency structure of the process.

Since the data at one point in a pipeline is dependent upon the configurations of all the previous steps, this can be represented cleanly and robustly with the following organization.

To summarize, please think hard at the beginning of your project about how you are going to organize your data as it grows. Something that works well for one file, or for two files, won't necessarily work well for a hundred files. Second, version control your metadata, not your data files—use a conventional backup system for those. Third, an intelligent structure not only makes your data easier to archive, track, and manage, but it also reduces the chance that the paths in the pipeline get crossed and the data out the end isn't what you think it is. That can save you embarrassment as well as time.

comments powered by Disqus