Revising the Curriculum
I’ve been thinking some more about what the foundation and core of Software Carpentry actually are (and not just because Jon Pipitone keeps pestering me to do so). My last attempt had a foundation of seven principles and dozen topics in the core. I think I can slim that down even further; in fact, I think three big principles form the foundation of computational thinking:
- It’s all just data, whose meaning depends on interpretation. This subsumes the notions that programs are a kind of data (which is the basis of things as diverse as functional programming and version control), and that we should separate models from views (because the most efficient ways for people and computers to interpret data are different). It doesn’t really include the distinction of copy vs. reference, but I’m going to lump it in here because that idea doesn’t seem big enough to deserve a heading of its own.
- Programming is a human activity. The only way to build large programs (or even small ones) is to create, compose, and re-use abstractions, because our brains can only understand a few things at a time. Similarly, good technique (specifically version control, testing, task automation, and some rules for collaborating, be they agile or sturdy) is necessary because everyone is tired, bored, or out of their depth at least once in a while.
- Better algorithms are better than better hardware. Computational complexity determines what’s doable and what isn’t, and no aspect of program performance makes sense without some understanding of it.
I also think we can reduce the core topics to just nine, though I can already hear protests from the back of the room about some of the omissions. I got this list by asking, “What’s the minimum I think a graduate student needs to know to contribute to the computational work in a typical lab?” My answer is:
- The Unix shell
- Includes: basic commands (from
lsandcdtosortandgrep); files and directories; the pipe-and-filter model. - Because: it’s still the workhorse of scientific computing (and is experiencing a resurgence as cloud computing becomes more popular).
- Illustrates: “lots of little pieces loosely joined” is a good way to introduce modularity and tool-based computing; it lets us talk the human time vs. machine time tradeoff.
- Omissions:
find; shell scripts (particularlyforloops); SSH.
- Includes: basic commands (from
- Version control
- Includes: update/edit/commit; merge (with rollback as a special case).
- Because: it’s a key technique.
- Illustrates: the idea of metadata; programming as a human activity (the hour-long red-green-refactor-commit cycle).
- Omissions: branching; distributed version control.
- The common core of programming
- Includes: variables; loops; conditionals; lists; functions; libraries; memory model (aliasing).
- Because: we can’t teaching validation, associative data structures, or program design without this common core.
- Illustrates: programming as a human activity (programs must be readable, testable, etc.).
- Omissions: object-oriented programming; matrix programming.
- Validation
- Includes: structured unit tests; test-driven development; defensive programming; error handling; data validation.
- Because: defense in depth is key to building large programs, and trustworthy programs of any scale.
- Illustrates: trustworthy programs come from good technique.
- Omissions: testing floating-point code (since we don’t really know how to).
- Program construction
- Includes: piecewise refinement; refactoring; design for test; first-class functions; using a debugger.
- Because: knowing the syntax of a programming language doesn’t tell you how to create a program.
- Illustrates: creating and composing abstractions; interface vs. implementation.
- Omissions: structured documentation.
- Associative data structures
- Includes: sets (as a prelude); dictionaries; why keys must be immutable.
- Because: useful in so many places.
- Illustrates: how the right algorithms and data structures make programs more efficient.
- Omissions: implementation details.
- Databases
- Includes: select; sort; filter; aggregate; null; join; accessing a database from a program.
- Because: useful in many contexts.
- Illustrates: separation of models and views; a different model of computation
- Omissions: sub-queries; object-relational mapping; database design.
- Note: we could illustrate many of the same ideas with spreadsheets, but they’re not as easy to connect to programs.
- Development methodologies
- Includes: agile practices (the usual Scrum+XP mix); sturdy (plan-driven) lifecycles.
- Because: ties many other lessons together.
- Illustrates: good technique makes good programs.
- Omissions: code review.
If we use a two-day boot camp to start, and follow up over six weeks with one lesson per week, I think we can cover:
| Topic | Boot Camp | Online | |
| 1. | Unix shell | ls and cd;files and directories |
sort and grep; pipes |
| 2. | Version control | update/edit/commit; merge | rollback |
| 3. | Core programming | all of it (but see below) | not needed (but see below) |
| 4. | Validation | unit tests; TDD | defensive programming; error handling; data validation |
| 5. | Program construction | One extended example; one demo of a debugger |
More examples; design for test; first-class functions |
| 6. | Associative data structures | none | everything |
| 7. | Databases | none | everything |
| 8. | Development methodologies | overview of agile | sturdy (plan-driven) lifecycle; evidence-based software engineering |
Topic #3, core programming, is the hardest to manage. If people have programmed in Python before, it can be a quick review (or omitted altogether). If they’ve programmed in some other interactive language, it can also be covered pretty quickly, but if they’ve never programmed before, or took one freshman course ten years ago, there’s no way to teach them enough to make a difference in half a day. Even if there was, the other learners would undoubtedly be bored. The only solutions I can see are to restrict participation to people who can already do a simple exercise in some language, or to run one day of pre-boot camp training for non- or weak programmers. Neither option excites me…
Coming back to content, this plans means that we’ll leave out a lot of useful things:
- Spreadsheets: lots of scientists use spreadsheets badly, but while we’d like to show them how to do so well, the only one they actually use, Excel, isn’t open source or cross platform, and it’s much harder to build programs around spreadsheets than around databases.
- Make: is very hard to motivate unless people are working with compiled languages—we’ve tried showing people how to build data pipelines using Make, but it’s too clumsy to be compelling. Plus, Make’s syntax makes a hard problem worse…
- Systems programming: knowing how to walk directory trees and/or run sub-processes is useful, but we think people can pick these up on their own once they’ve mastered the core.
- Matrix programming: really important to some people, irrelevant to others, and the people it’s important to will probably have seen the ideas in something like MATLAB before we get them.
- Multimedia programming (images, audio, and video): people can learn basic image manipulation on their own; audio and video are harder, mostly due to a lack of documentation, but they aren’t important enough to enough people to belong in our core.
- Regular expressions: are a great way to illustrate the idea that programs are data, and are very useful, but everything in the core seems more important, and it’ll be hard enough to get through all that in the time we have. This is probably the one I most regret taking out…
- HTML/XML: there are lots of excellent tutorials on writing HTML, and while XML processing is a good way to introduce recursion (and, if XPath is included, to talk about programs as data once again), I believe once again that it’s not important enough to displace any of the material in the core.
- Object-oriented programming: is probably the omission that raises the most eyebrows. We can introduce it fairly naturally when talking about design for test (more specifically, about interface vs. implementation), but in practice, most people get along fine using lists, dictionaries, and the classes that come with the standard library without creating new classes of their own. Plus, showing people how to do OOP properly takes a lot more time than just showing them how to declare a class and give it methods.
- Desktop GUIs: an excellent way to introduce reactive (event-driven) programming and program frameworks, but is less important than it was ten years ago (most people would rather have a web interface these days).
- Web programming: the only thing we can teach people in the time we have is how to create security vulnerabilities.
- Security: the principles are easy to teach, but translating them into practice requires more knowledge (especially of things like web programming) than we can assume our learners have.
- Visualization: everybody wants it, but nobody can agree what it means. Should we show people how to use a specific library to create 3D vector flows? Or the principles of visual design so that they can make nicer 2D charts? And no matter what we teach, will they actually learn enough to make a difference?
- Performance and parallelism: the most important lesson, which is in the core, is that the right data structures and algorithms can speed programs up more than any amount of code tuning. Everything after that is either inextricably tied to the specifics of a particular langauge implementation (performance tuning), or offers no low-hanging fruit (parallelism). The one exception is job-level parallelism, which could be included in the material on the Unix shell if an appropriate cross-platform tool could be found.
- C/C++, Fortran, C#, or Java: more to introduce fixed typing and compilation, but these are relatively low priority topics.
We’re going to start implementing this plan (or some derivative of it) at the beginning of February, to be ready for workshops starting at the end of that month. We’d welcome feedback; in particular, have we taken something out of the core that you think is more important than something that’s in, and that could be taught in the time that’s actually available? If you have thoughts, please let us know.

I know I’m referring to a single point in your rather extensive post, but would you have more info/links on how to build data pipelines ? I’m curious as to how other people do this, or if there are “best practices” to do this. I can’t find anything obvious on the internets.
thanks
Two thoughts –
spreadsheets == csv, and operating on csv is a powerful tool. It could be part of the “core programming” tutorials, to load a csv file into a list and extract information.
For job-level parallelism, how about GNU parallels? I’ve been impressed. I think it’s cross-UNIX.
@Vincent Noel It depends what kind of pipelines you’re interested in. Some people write/use a bunch of command-line Unix tools and pipe them together (as in “preprocess mydata.txt | filter -x -y 178 | reframe –epsilon 0.05 | magnify | data2csv > myresult.txt”). At the other end there are tools like Taverna, though you may want to read Titus Brown’s posts http://ivory.idyll.org/blog/dec-11/data-intensive-science-and-workflows.html and http://ivory.idyll.org/blog/dec-11/four-reasons-i-wont-use-your-data-analysis-pipeline.html before committing to one of them. Does this answer your question?
@Titus Brown Respectfully, spreadsheets are *not* just CSV — spreadsheets have formulas, and when working with them programmatically, you want the formulas re-evaluated dynamically. That said, yes, reading CSV is enough to get data out if all you want to do is read, so that could be part of the core. I’m probably storing my spreadsheet as .xls, not as .csv, and I don’t want to have to manually export each one (that would defeat the purpose of automation), but John Machin and Chris Withers have built a library that might help here (https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html).
Regarding job-level parallelism, a couple of other people have pointed me at GNU Parallel as well — I’ll check it out. (I think that as soon as we’re talking “jobs” and “parallel”, we can assume people are using Linux to do the work — is that safe?)
I really like the core designated here and I think there’s a real value to having a focused, well defined core (while also allowing folks to fill in other areas that they think are important as separate “contributed modules”). The choices that have been made fit in well with my experience of what students need and want, and my recommendation would be to keep the core as proposed.
That said, I would recommend making a change to what is included vs. not included in the bootcamps. Part of what the bootcamps need to achieve is buy in from the students so that they are sold enough on the value of SWC (and trust the judgment of its creators sufficiently) that they are willing to spend the time and effort working online. The challenge to this is that the proposed bootcamps currently include 2 topics that in my (admittedly limited) experience are rather difficult for students to immediately grasp the value of for their own work, the shell and testing, and leaves out one that they all gravitate to and pick up quickly, databases (specifically SQL). I think that for maximum buy in we can afford to introduce either testing or the shell, but not both. I would personally recommend testing since it benefits from repeated exposure and if students need the shell they will be forced to sit down and learn it (whereas testing they could just ignore). This also eliminates the need for Cygwin installs, which in my experience often run into problems at some point. The one that is dropped could be replaced with databases, where the students can learn a lot in a couple of hours and which they are generally really excited about. This then builds momentum for them spending the time online to learn tools that they might not be as initially excited about.
I hear what you say about Make, but what about build systems more broadly (CMake, Autotools)? This seems like something that has value for checking dependencies, etc, of software that people may want to share/distribute. I think there may be different paradigms for Python code?
Also, what about third party libraries and other dependencies? Not specific examples, but the tools and skills to determine what you need and build/install them as necessary?
I’ve never used CMake in anger; as powerful as GNU Autotools might be, it’s awful to learn. I’d like to use something modern like SCons, but it’s hard to motivate “build” tools for a dynamic interpreted language that doesn’t need compilation, and other use cases (e.g., the one in the current tutorial, which regenerates results when data files change or new ones appear) feels forced.
Regarding third party libraries and dependencies, I don’t know what “tools and skills” will help people determine what they need—I’d welcome suggestions (preferably in the form of a syllabus
.
You don’t need to be respectful — good point, CSV != XSL. But working with csv files is a good gateway, and it’s easy.
I didn’t realize for “databases” you meant “SQL databases”. I think there’s a LOT of value in basic things like pickle; relatively few people have sufficiently large data sets that SQL is the right first choice. SQL introduces a lot of complexity for relatively little gain in sci comp, IMO.
I recently pointed a local student at the combination of CSV, dicts, pickle, and a simple query API. I told her that once it got too slow, she should think about SQLite. Since her data is all hand-entered, though, I doubt it will ever be an issue.
Yes, SQL databases: I use other things, but it’s what students are most likely to bump into outside this training. We talk about things like pickle in the essay on persistence, but the learners I’ve spoken to have mostly found it beyond them (i.e., they can use the library to do simple things, but don’t really understand how it works).
Data pipelines++. I’m currently writing something up in a Makefile format. Hate it. But way more awesome than the alternatives, AFAICT.
@Ethan White One reason to include the shell is that it allows us to do version control from the command line, rather than needing GUI clients. We’ve found that the former is easier than the latter (even factoring in Cygwin headaches on Windows). And while we can assume prior exposure to the shell on the part of astronomers and physicists (usually), it’s less common in the bio sciences, and very much less common in psych, linguistics, etc.
Do other people have thoughts on this?
Miscellaneous reactions…
Given the two-day format, I think it is most productive to focus of getting people started in the right direction so they can teach themselves. I know Greg has said this in several venues. They will mainly be motivated by seeing how something will be useful for their own work, rather than just as an abstract “you’ll thank me later” exercise. It is the best feeling when people say “I wish I had known about this last week!”
So that said, an intro to shell, python, databases are really important. Other things like pipelines and parallelism are only going to follow from that basic foundation.. To me, skipping the shell is not really an option, because they have to at least understand the relationship between files in their GUI and at the command line to generate pipelines and workflows.
It would be great to show intro to R as an alternative to spreadsheets in many cases. I also have a soft spot for regular expressions and graphical concepts (technical, not stylistic).
We have had many discussions about how to handle Make, etc. It is often what drives people to the command line, but the playing field is so open that it is hard to teach The One Thing that will suddenly make their lives easier.
The current scenario is to do databases entirely in the follow-up? It might be good to at least get them up and running on that as well during the bootcamp.
Thanks for the feedback. I’d be happy to move databases into the initial two days, but what should we delay to the online portion to make room?
@Titus Brown GNU Parallel is designed to be cross-UNIX – even on older UNIX versions. It is has also been reported to work on Windows. An easy intro is to watch the intro videos: http://www.youtube.com/watch?v=OpaiGYxkSuQ&list=PL284C9FF2488BC6D1
@Greg Wilson
Yeah, that’s great, thanks for the pointers.
Each steps require quite a bit of computing time, so it’s best if I can run steps independently (ie don’t need to run the entire pipeline every time). I was curious if something more formalized existed.
Personnally, I segment the analysis process in several steps taking as input the output from the previous step, and link them using folder filesystem links (the steps are rather decoupled). Rather primitive
@Greg Wilson
I think that something like joblib http://packages.python.org/joblib/ would fit my needs best. It’s an effort led by Gael Varoquaux, who worked with Emthought for a while (I think).
One point that I really like in your post is the assertion that “It’s all just data”. In the drive to put together a “perfect” curriculum we sometimes forget the people who come to Software Carpentry not because they want to learn to program but because the need to program to manage and analyse data, and they’d like to do it better.
So there’s the incentive to learn, but the hook is that it makes them do their day job better.
As a result I think there’s a case to be made to put some basic data handling and data modelling techniques in the boot camp, to encourage people to continue online. Then what do you drop? I haven’t really come up with a good answer to this one – my gut feeling is that it might be either TDD or Agile, but then again early exposure to these would seem to be more tractable in the long-run…
Also I hear the comments about both spreadsheets and visualisation – having previously taught courses on scientific visualisation, it’s at least a day to get across the core concepts, and yet knowledge of the basics of both would stand you in good stead in most labs.
I agree that both TDD and Agile could be dropped (for some reason I didn’t see that Development Methodologies were in the bootcamps before and it would be at the top of my list for replacing) in favor of something data centered, which for me would be databases.
Maybe terms of art just differ field to field, but what you’re calling validation here is what people in the modelling and simulation areas would call verification, and not validation.
the traditional CS attitude is that algorithms trump architecture (because quicksort is so much more efficient than bubblesort, etc). algs are fine, but unless you know that memory is hundreds or thousands of cycles away from the CPU, even nicely-scaling algs are not good. you’ve really got to have an intuitive sense of the inherent timescale of operations. yes, that’s very hardware-specific, and has a half-life of a few years, but it’s also fundamental.
at a rather different level, I think use of ssh should be required as well. not just “ssh me@host”, but the places where it really gives you leverage, such as sshfs, ssh-agent+keys and port forwarding.
I agree that architecture and SSH are useful — but the rule is, if you want to add something, you have to identify something that’s less important that we can take out. What would you pick?