Teaching basic lab skills
for research computing

Systematic Curriculum Design

Executive summary: we'd appreciate your help organizing and motivating our material better.

One of the good things about traveling is that it gives me time to think. One of the bad things about thinking is that every time I do, I wind up with more work than I had when I started. For example, to organize and motivate our content, I'm using eight questions that scientists frequently ask:

  1. How can I manage this data?
  2. How can I process it?
  3. How can I tell if I've processed it correctly?
  4. How can I find and fix bugs when I haven't?
  5. How can I keep track of what I've done?
  6. How can I find and use other people's work?
  7. How can other people find and use mine?
  8. How can I do all these things faster?

On the other side of the equation I have a syllabus for the core Software Carpentry material, which includes:

  • the command-line shell (e.g., Bash)
  • version control
  • basic programming (variables, lists, loops, conditionals, and simple file I/O)
  • functions and libraries
  • databases (i.e., basic SQL queries)
  • matrix programming (e.g., MATLAB or NumPy)
  • quality assurance (defensive programming, testing, etc.)
  • dictionaries (or hashes, if you're a Perl programmer)
  • the development process (stepwise refinement, red-green-refactor, performance profiling)
  • web programming (by which we mean using web APIs, not providing services yourself)

In order to figure out how well we're helping scientists, we need to map their needs onto our content. Here's what I've come up with:

Question Subject Answer
How can I manage this data? The Shell Use directories and sub-directories with meaningful names.
Use filenames that can easily be matched with wildcards.
Use filename extensions that indicate the type of data in the file.
Use text unless there's a powerful reason to use something else.
Version Control If it's megabytes or less, put it under version control.
Basic Programming Create and use data formats that are easy for programs to parse.
Functions and Libraries
Databases Store it in a relational database.
Store each atom of information in its own field.
Make sure each record has a unique key.
Make sure that information is never duplicated.
Use foreign keys and joins to combine information from different tables.
Number Crunching Represent it as a matrix, because that's easy to process.
Quality
Sets and Dictionaries Store it in a set or dictionary so that elements can be looked up by value rather than by position.
Development
Web Programming Format it as HTML (or XML, or some other widely-used format).
Separate content from presentation (e.g., use CSS for styling).
Question Subject Answer
How can I process it? The Shell Use Unix commands that manipulate lines of text.
Combine those commands using pipes and redirection.
Use loops to perform the same operations on many files.
Version Control
Basic Programming Write programs that use loops, file I/O, and string splitting to read data.
Use floating-point numbers unless you are sure all values (including calculated values) will always be integers.
Functions and Libraries Define functions to do simple operations, then combine those for more complicated effects.
Equivalently, describe what you would do in a language customized to your problem, then fill in the missing bits of code by creating functions.
Databases Write SQL queries to select, filter, aggregate, and sort data.
Use a general-purpose programming language for everything else.
Number Crunching Use a linear algebra package like NumPy.
Quality
Sets and Dictionaries Use algorithms that don't depend on the order of items.
Development Use the right data structures.
Web Programming Use an HTTP library to fetch it.
Use an XML or JSON library to parse it.
Question Subject Answer
How can I tell if I've processed it correctly? The Shell
Version Control
Basic Programming Test your programs with small data sets whose results can be checked by hand.
Functions and Libraries
Databases Build queries in small steps.
Run queries against small data sets whose output can be checked manually.
Number Crunching Compare a program's output to analytic results, experimental results, simplified test cases, and previous programs.
Use tolerances when comparing results.
Quality Create simple data sets for which the right answer can be calculated by hand.
Compare the results produced by the new program to results produced by older programs.
Sets and Dictionaries
Development Make code testable by dividing it into functions, and then replacing some functions with others for testing purposes.
Web Programming
Question Subject Answer
How can I find and fix bugs when I haven't? The Shell
Version Control
Basic Programming
Functions and Libraries
Databases
Number Crunching
Quality Write test cases that fail when the bug is present, but pass when the bug is fixed.
Add assertions to programs to check its internal consistency.
Use a debugger.
Sets and Dictionaries
Development Write tests.
Web Programming
Question Subject Answer
How can I keep track of what I've done? The Shell
Version Control Keep your work under version control.
Check in whenever you've completed a significant change.
Write meaningful check-in comments.
Basic Programming Put version control IDs in programs (and data files), and copy them forward to results.
Functions and Libraries Give functions meaningful names.
Group related functions and related definitions into modules.
Write docstrings to explain what functions and modules do and how to use them.
Databases Store queries in files (just like programs).
Number Crunching
Quality Turn bug fixes into assertions and test cases.
Use a coverage analyzer to see what code is and isn't being tested.
Sets and Dictionaries
Development
Web Programming Use meta headers in your HTML/XML data files.
Question Subject Answer
How can I find and use other people's work? The Shell
Version Control Get it from their version control repositories.
Basic Programming
Functions and Libraries Use the help function to read their documentation.
Databases
Number Crunching
Quality
Sets and Dictionaries
Development
Web Programming Ask them to use well-formed URLs.
And to format it according to well-defined machine-readable standards (e.g., XML or JSON).
Question Subject Answer
How can other people find and use mine? The Shell
Version Control Put your work in a publicly-accessible version control repository.
Basic Programming
Functions and Libraries Write docstrings to explain what functions and modules do and how to use them.
Databases Raise exceptions to signal errors so that other people can handle them as they think best.
Number Crunching
Quality
Sets and Dictionaries
Development
Web Programming Put it on the web at a stable URL.
Format it according to well-defined machine-readable standards (e.g., XML or JSON).
Include meta-data.
Question Subject Answer
How can I do all these things faster? The Shell Put commands in shell scripts so that they can be re-used.
Version Control
Basic Programming Use appropriate variable names so that people will waste less time trying to read programs.
Functions and Libraries Learn to recognize and use common design patterns.
Databases
Number Crunching Use a linear algebra package like NumPy.
Quality Design code for testing.
Write test cases before writing new code.
Sets and Dictionaries Use sets and dictionaries for sparse, irregular, or unordered data.
Development Use a profiler to figure out why code is slow before trying to optimize it.
Build code so that parts can be replaced easily.
Web Programming

In parallel with this, a group of us have been working on a paper describing best practices for computational science. The list we've converged on is:

  1. Write programs for people, not computers.
    • Programs should not require their readers to hold more than a handful of facts in memory at once.
    • Names should be consistent, distinctive, and meaningful.
    • Code style and formatting should be consistent.
    • All aspects of software development should be broken down into tasks roughly an hour long.
  2. Automate repetitive tasks.
    • Rely on the computer to repeat tasks.
    • Save recent commands in a file for re-use.
    • Use a build tool to automate scientific workflows.
  3. Use the computer to record history.
    • Software tools should be used to track computational work automatically.
  4. Make incremental changes.
    • Work in small steps with frequent feedback and course correction.
  5. Use version control.
    • Use a version control system.
    • Everything that has been created manually should be put in version control.
  6. Don't repeat yourself (or others).
    • Every piece of data must have a single authoritative representation in the system.
    • Code should be modularized rather than copied and pasted.
    • Re-use code instead of rewriting it.
  7. Plan for mistakes.
    • Add assertions to programs to check their operation.
    • Use an off-the-shelf unit testing library.
    • Turn bugs into test cases.
    • Use a symbolic debugger.
  8. Optimize software only after it works correctly.
    • Use a profiler to identify bottlenecks.
    • Write code in the highest-level language possible.
  9. Document the design and purpose of code rather than its mechanics.
    • Document interfaces and reasons, not implementations.
    • Refactor code instead of explaining how it works.
    • Embed the documentation for a piece of software in that software.
  10. Conduct code reviews.
    • Use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems.
    • Use an issue tracking tool.

As you can see, this list only partially overlaps the "Answers" column in the table above. That makes me nervous: when two independent attacks on a problem yield two different answers, the odds are good that neither of them is right. I trust the "best practices" list more than I do the breakdown of our existing material, which leaves me with some awkward choices. Changing the motivating questions would feel like moving the goalposts so that I can declare victory with the content I have, but on the other hand, maybe there is a better way to carve up the space of things scientists want to do that will give a better mapping. Or are there connections between our content and those motivating questions that I'm just missing? Or do we really have the wrong content, i.e., are we teaching what we know, rather than what would actually be most useful to scientists?

Stepping back for a moment, the real point of this exercise is to ensure that:

  1. we're teaching what's most useful to our learners;
  2. everything we teach makes sense, and is seen as useful, when it first appears; and
  3. learners see the connections between ideas and between ideas and their application.

What we should really do is go one step further and figure out how to tell whether our learners can actually do the things embodied in our eight questions. We should then work backward from that assessment to figure out what demonstrable skills they need to acquire, then what understanding they need in order to become proficient with those skills, and then see how that maps onto our best practices. We've made a start toward this with the "driver's license" exam described in an earlier post; if you'd like to help us follow through, please get in touch.

Dialogue & Discussion

You can review our commenting policy here.