I am an ecologist. Traditionally, many ecologists spend part of the year outdoors, in the field, to collect data, and then spend the rest of the year analyzing that data and writing papers. With new sensors that passively collect audio, weather, or geolocation data, even a single field season can yield massive amounts of data. Add to that the pressing need for more long-term data (aggregating multiple years) and the problem of managing and analyzing the data becomes increasingly difficult for scientists without basic computational skill sets or strong computational collaborators.
I recently began a postdoc at Stony Brook University with Catherine Graham, studying climate change impacts on hummingbird diversity. For one of our "small" projects, we collected 1 TB of data in only one month of fieldwork! This includes a variety of sensor-generated data (audio, weather), and human-generated data (bird and floral observations), which are stored as .csv files.
We haven't had an opportunity to analyze the data yet, but many of the skills taught at Software Carpentry workshops will be key in moving this project forward.
1. With a lot of data comes a lot of responsibility. The data are currently stored on multiple external hard drives, any my copy is under local version control. I worked with my collaborators to write a detailed metadata file (so important!), a README file, and a methods file that accompanies our data. The metadata file describes the who, what, where, when, and why, of our data, the README file outlines all potential "problems" with our data, and the methods file describes the study design and fieldwork in detail. Our work involves a large number of collaborators, and a large amount of data. These files are essential to our basic organization, and to make sure no important details are lost or forgotten. For more on good data practices: read this.
2. A lot of data can be efficiently stored and queried using relational databases. The data are currently managed in a relational database. This allows us to link data across multiple sites and observation types, and allows us to store the data efficiently, with minimal repetition. I worked with my collaborators from Ecuador to design the database to store the data and to make it as easy as possible for future collaborators to query the data and to enter new data, if the study is repeated in the future.
3. A lot of data can be quickly analyzed using a good programming language. Most of the analysis will be done using R, which is the dominant language among ecologists. We will work together to conduct the analyses, which will require good coordination and communication among collaborators, especially since we are scattered across the western hemisphere. This is much easier when code is clean, concise and modular.
4. No amount of good code and good data is useful if you don't have provenance. The metadata, error log, and detailed methods documents are part of provenance – we can access these files to remember who did what (and why), what went well, and what went wrong in the study. Version control is another important part of this. Keeping track of changes in the code, data, and documentation will let us know who made changes, how often, and will allow us recall older versions of the analysis, in the case that it gets broken. Good provenance should document everything in enough detail to make the research reproducible for ourselves, for future collaborators, and for publication.
This is very much a work in progress, and I'm excited to see where we will end up. Having large amounts of data is a wonderful problem to have, but for many scientists, it is a very real problem. This is why I think that the work that Software Carpentry volunteers are doing is so important. Even having a very basic level of computational literacy can help scientists communicate more effectively with computational collaborators, and ultimately, to be more independent.