Teaching basic lab skills
for research computing

The Big Picture

One of the lessons we learned at Los Alamos National Laboratory in the 1990s and early 2000s is that most scientists don't actually want to learn how to program—they want solve scientific problems.To many,programming is a tax they have to pay in order to do their research.To the rest,it's something they really would find interesting,but they have a grant deadline coming up and a paper to finish.

Getting scientists to make time to learn fundamental ideas that aren't directly relevant to the problems in front of them is an even harder sell. Partly it's those pesky deadlines again, but it's also often the case that the big picture doesn't make sense until you have first-hand experience with the details. Take abstraction, for instance, or the difference between interface and implementation: if you haven't written or modified software where those ideas saved you time and heartache, no amount of handwaving is going to get the idea across. The problem, of course, is that it's impossible to program well without understanding those bigger concepts.

Software Carpentry therefore has to:

  1. Give scientists programming skills that have a high likelihood of paying large dividends in the short term.
  2. Convey the fundamental ideas needed to make sensible decisions about software without explicitly appearing to do so.

Based on our experiences in the last 12 years, the skills that students need are fairly settled:

  • Clean coding (both micro-level readability and macro-level modularity)
  • Version control
  • Process automation for building, testing, and deploying software
  • How to package software for distribution and deployment
  • Managing information and workflow (from bug trackers to blogs)
  • Consuming data:
    • Text (line-oriented parsing with regular expressions)
    • Hierarchical (XML)
    • Binary
    • Relational
  • Building desktop GUIs and visualizing data
  • Basic security: public/private keys, digital signatures, identity management
  • Publishing data and providing services on the web

As Karen Reid and others have pointed out, doing all of that properly would earn you at least a minor in Computer Science at most universities. Cramming it into two weeks is simply not possible.

The bigger picture stuff isn't as clear yet, but is starting to come into focus. The buzzword du jour, computational thinking, means different things to different people, but Jon Udell's definition is a good starting point. For him, computational thinking includes:

  • Abstraction: ignoring details in order to take advantage of similarities
    • A key concept is the difference between interface and implementation
  • Querying: understanding how fuzzy matching, Boolean operations, and aggregate/filter dataflow works
    • This depends somewhat on understanding how to think in sets
  • Structured data: including hierarchical structure, the notion of meta-data (such as tagging and schemas), and so on
    • Equally important is understanding that programs work best with structured data, so structure improves findability and automation
  • Automation: having the computer do routine tasks so that people don't have to
  • Indirection: giving someone a reference to data, rather than a copy of the data, so their view of it is always fresh
  • Syndication: publishing data for general use, rather than sending it directly to a restricted set of people
    • The inverse is provenance: where did this data come from, and what was done to it?

I would like to add all of the following, though I realize that doing so gets us back into "B.Sc. in a week" problems:

  • Name spaces, call stacks, and recursion
  • Computational complexity: why some algorithms are intrinsically faster than others
  • How data is organized:
    • Values vs. references and the notion of aliasing
    • By-location structures (lists, vectors, and arrays)
    • By-name structures (dictionaries and records)
    • By-containment structures (trees)
    • By-traversal structures (graphs)
  • Programming models:
    • Procedural
    • Aggregate (whole-array, whole-list, etc.)
    • Object-oriented
    • Declarative
    • Event-driven (which brings in the difference between frameworks and libraries
  • Programs as data
    • Functions as objects (another form of abstraction)
    • Programs that operate on programs (Make, drivers for legacy programs)
  • Quality, including:
    • What makes good code better than bad code (psychological underpinnings)
    • Testing (including the economics of testing)
    • Debugging (the scientific method applied to software)
    • The difference between verification ("have we done the thing right?") and validation ("have we done the right thing?")
    • Continuous improvement via reflection on root causes of errors
  • Basic concurrency:
    • Transactions vs. race conditions
    • Deadlock (much less important in practice)
  • Handling failures
  • Bricolage: how to find/adapt/combine odds and ends (these days, on the web) to solve a problem

I call on all of this knowledge routinely even when solving trivial problems. This morning, for example, I:

  • did a search to find a wiki markup processor I could run from the command line,
  • downloaded and installed it,
  • changed five lines in the main routine to insert some extra text in its output,
  • added a ten-line filter function to overwrite the inserted text with some command-line parameter values, and
  • added fourteen lines to a Makefile to turn the wiki text into HTML whenever it's stale.

It took roughly 15 minutes, and will save me hours in the weeks to come. However, it only took 15 minutes because I've spent 29 years mastering the skills and ideas listed earlier. The challenge in creating Version 4.0 of this course will be to figure out how to convey as many of those skills and ideas can be squeezed into two weeks.

Dialogue & Discussion

You can review our commenting policy here.