Our Long Tail
Frequency a page is viewed (as a percentage of total views) vs. pages (ordered by frequency); data taken over the last 90 days. I guess there’s something to this “long tail” stuff after all.
Frequency a page is viewed (as a percentage of total views) vs. pages (ordered by frequency); data taken over the last 90 days. I guess there’s something to this “long tail” stuff after all.
I’m still gnawing on the problem of how to construct content for 21st Century learning—or, more prosaically, what I should use to build the next version of Software Carpentry. My starting point is the need to serve several different kinds of users [1]:
Their learning options today are:
I’m sure some of the above is inconsistent or just plain wrong, but here are my takeaways:
OK, so how well do today’s tools and/or formats do by these measures? The fact that “PowerPoint” is both a tool and a format is one indication that the answer is going to be, “Not well.”
The fact is, most presenters continue to use PowerPoint (or something similar) because it makes it easy to create a reasonably good presentation in a reasonable amount of time [5]. HTML slideshow packages fail this test: authors must sacrifice the quality of the presentation (e.g., skip graphics, or embed segregated graphical files), and do a lot of non-content typing (tags, page IDs, and so on).
So after all of this, what do I actually want?
meta and data-* vocabulary for educational metadata, like dependencies, introduction of key terms, questions and answers, and so on. I want a similar vocabulary for commenting and other social interactions that plays nicely with things like the Salmon protocol.Freeling mixing drawings and text feels like it ought to be doable today: we could either put the text in blocks inside a canvas element, or layer a transparent canvas over the page and dynamically resize it. Anchoring drawings to the underlying text (e.g., keeping the arrow from a term to the corresponding bit of the diagram in the right place) is “just” Javascript (for some value of “just”). Making it all WYSIWYG is just more Javascript [10].
But animation… Ah, that’s a big one. It’s an intrinsically hard problem, but canned effects can do a lot to put simple things within reach [11]. The big question is, how far do we push it? If I want to show you how to use a debugger, or how to draw something with a painting program, I can’t re-create the whole UI—I’m going to have to record pixels off a screen.
Or am I? I know this is never going to happen—we’re not that organized a species—but just imagine what the world would be like if every interface was built using HTML5 and CSS. Any tool at all could export widget descriptions and a semantic trace of what they did (i.e., “the file menu was pulled down” rather than “the cursor moved to pixel (132,172) and the user clicked”), and any other tool could consume it and play it back. The consuming tool might draw the widgets differently, or display the interactions in its own way, but that would be exactly the same as applying a different skin to the original tool [12].
Returning to this universe for a moment, we can store things as HTML5 right now—I’m already using it for Version 5 of Software Carpentry. I could create a vocabulary for instructional metadata, but I’m not an information architect. WYSIWYG authoring tools for HTML5 abound, though the HTML5 they produce can be idiosyncratic (and doesn’t play nicely with version control, but that’s fixable). I haven’t seen a WYSIWYG tool that supports freehand drawing mixed freely with text, or one that supports parallel content streams, but I think half a dozen people working could deliver something substantial in half a dozen months [13].
As for animation, I think we’re stuck with video for now: prototyping an HTML5/SVG/Javascript animation framework for use in a learning tool would be a great research project, but we really do need to build a couple to throw away to find out if it’s workable. If you’d lke to tackle it, please let me know—I’d be happy to be your alpha tester.
Notes
[1] There was a lot of talk in the 1980s and 1990s about different people having different learning styles, inspired in part on Gardner’s theory of multiple intelligences. The idea has mostly been discredited, but like many memes, it lives on in popular culture.
[2] Although I bet someone’s working on an Emacs mode to do that…
[3] I’ve actually done this, so I know whereof I speak.
[4] Except that LaTeX and wiki text require slightly less typing than HTML, but if you’re using a smart editor, even that advantage goes away.
[5] Please don’t quote Tufte’s complaints about PowerPoint at me—I don’t think it encourages bad presentations any more than the tangled rules of English spelling and grammar encourage bad writing.
[6] In particular, almost all video content makes life harder for the visually impaired: a screencast in which someone talks over themselves typing in an editor or sketching on a tablet is tantalizing but useless to someone who can’t see the pixels. I committed this sin when I created Version 4 of Software Carpentry; I’d like to do better in Version 5, and would like to see high-profile online learning sites make some kind of effort as well.
[7] But wait a second: if video isn’t effective, why do MIT Open Courseware and the Khan Academy work so well? The short answer is, they mostly don’t: if you take out the 15% of people who can learn almost anything, no matter how it’s presented, watching videos and doing drill exercises works less well than other options. The longer answer is, watching a good teacher (and Khan is a great teacher) work through a problem, instead of just presenting the answer, moves the content into the “how to” category that video is well suited to.
[8] Research dating back to the early 1990s shows that higher-quality material improves student retention. I don’t know whether it improves it enough to justify its higher production costs, though.
[9] HTML5 will also help with version control, since I expect HTML5-aware diff-and-merge tools to start appearing Real Soon Now. Of course, I’ve been saying that for almost ten years…
[10] These days, you can wave away almost any technical objection with “it’s just more Javascript”.
[11] In my mind, the animation interface looks more like Scratch than it does like PowerPoint’s menus and dialogs. It definitely doesn’t require people to type in code, unless they want to create and share an entirely new kind of animation effect.
[12] We could even call that format XUL…
[13] “6×6″ is as big a team/timescale as I’m able to contemplate these days.
I’m trying to be systematic about re-designing the core curriculum of Software Carpentry. So far, I’ve identified 11 common questions:
| Q01: | How can I write a simple program? |
| Q02: | How can I make the program I’ve written easier to reuse? |
| Q03: | How can I reuse code that other people have written? |
| Q04: | How can I share my work with other people? |
| Q05: | How can I keep track of what I’ve done? |
| Q06: | How can I tell if my program is working correctly? |
| Q07: | How can I find and fix bugs when it isn’t? |
| Q08: | How can I get data into my program? |
| Q09: | How can I manage my data? |
| Q10: | How can I automate this task? |
| Q11: | How can I make my program faster? |
whose answers depend on three fundamental principles:
| F01: | It’s all just data. |
| F02: | Programming is a human activity. |
| F03: | Better algorithms are better than better hardware. |
These break down into 11 more specific principles:
| P01: | Code is just a kind of data. |
| P02: | Metadata makes data easier to work with. |
| P03: | Separate models and views. |
| P04: | Trade human time for machine time and vice versa. |
| P05: | Anything that’s repeated will eventually be wrong somewhere. |
| P06: | Programming is about creating and composing abstractions. |
| P07: | Programming is about feedback loops at different timescales. |
| P08: | Good programs are the result of making good techniques a habit. |
| P09: | Let the computer decide what to do and when. |
| P10: | Sometimes you copy, sometimes you share. |
| P11: | Paranoia makes us productive. |
which in turn translate into 11 recommendations:
| R01: | Use the right algorithms and data structures. |
| R02: | Use a version control system. |
| R03: | Automate repetitive tasks. |
| R04: | Use a command shell. |
| R05: | Use tests to define correctness. |
| R06: | Reuse existing code. |
| R07: | Design code to be testable. |
| R08: | Use structured data and machine-readable metadata. |
| R09: | Separate interfaces from implementations. |
| R10: | Use a debugger. |
| R11: | Design code for people to read. |
Here’s how I see all this mapping onto the curriculum (assuming we replace agile development with number crunching):
Comments and suggestions would be very welcome.
Based on the feedback we’ve received so far (both as comments and by email), it looks like we should take development methodologies (i.e., agile development) out of the core curriculum and replace it with two hours on:
I am quite arbitrarily limiting options to those five. Please cast your vote (one vote, not three out of five) in comments. We’d be grateful if you could include a brief explanation as well.
Once again, Cameron Neylon explains things much better than I ever could:
“The impact factor of a journal is a better predictor of the chances of a paper being retracted than…of the number of citations.”
One of the things we need to do in the next six months along with running boot camps and updating our online content is to create some sort of badging to recognize people’s skills and contributions. As we said in the proposal to the Sloan Foundation, “A badge program will provide near-term incentives for both learning and mentoring; a framework to support viral, peer-driven engagement with the program; and facilitate recognition by partner institutions and potential employers.”
We’re going to rely on Mozilla’s Open Badges project to handle the mechanics of storing and validating badges, so we only have three questions to answer:
The obvious answer to the first (and most important) question would be, “You get a badge for completing the core curriculum.” However, one of the purposes of badging is to provide a finer-grained inventory of people’s knowledge and skills, so there’s an argument to be made for giving one badge per topic, e.g., a version control badge, a Unix shell badge, a basic imperative programming badge, and so on. The argument for is that their meaning will be clearer: if I say, “Jane knows the basics of Subversion,” that’s more immediately understandable than, “Jane has completed the core of Software Carpentry.” The argument against is that if someone has collected two hundred small badges, we’re going to aggregate them anyway (“Jane knows basic software development skills”), so why not just do that in the first place.
I’ve gone back and forth on this, but currently think that one badge for the core curriculum (“Basic Software Carpentry”) will work best. We will offer two other badges as well: one for organizing a boot camp, and one for contributing a medium-sized chunk of content (on the scale of one 5-minute video episode).
Having decided that, the next challenge is to determine when someone has earned a particular badge. The “Boot Camp Organizer” and “Content Contributor” badges are straightforward; telling when someone has mastered the core skills is not. We can tell that you’ve attended the boot camp and viewed the videos, but how can we tell how much you’ve actually learned? “Solve this problem and email us the result” isn’t good enough: you could get someone to do it for you [1], and even if you’re honest, we can’t tell how quickly you did it, how many blind alleys you went down, how often you did something in ten steps instead of one, and so on. In the short term, I think the solution is to do assessment in real time using desktop sharing, i.e., you share your desktop with me, I give you the problem to solve, and I watch you do it. This won’t scale to hundreds or thousands of learners, but it’ll get us through the next six months.
What will badges look like? A badge is just a small PNG file with a digital signature embedded in it (it’s a neat little hack), so the graphic design is up to us. I like our current logo, but (a) it doesn’t size down well, and (b) I’ve been wanting to redesign it anyway, since the blue-to-white fade in the background doesn’t print well on t-shirts, coffee mugs, and other media. In keeping with our carpentry theme (“We’re not teaching people how to build the Channel Tunnel, we’re teaching them how to hang drywall”), I’d like an image that combines tools like hammers and saws with something like 1′s and 0′s to represent software, but I’m a lousy graphic designer—if any of our readers would like to take a crack at it, please let me know.
Finally, and most importantly, how can we get existing institutions—specifically universities—to recognize badges in some way? As much as we’d like people to value skills for their own sake, everyone is always busy, and always has more to do than time to do it. Can we persuade a few schools to list badges as non-credit items on students’ transcripts (just as they might presently list a short course in presentation skills or entrepreneurship that doesn’t count toward degree requirements, but required some work on the student’s part)? If so, it would give people an extra incentive to complete the core curriculum, organize a workshop, or create some content for us, particularly in a tight job market where every small distinction counts.
[1] It’s unlikely that someone would cheat on a Software Carpentry exercise, but in general, if badges take off and actually start to matter, the people who sell college students essays on Steinbeck at $30 a pop will start offering to write their online exams for $50 each.
I’ve been thinking some more about what the foundation and core of Software Carpentry actually are (and not just because Jon Pipitone keeps pestering me to do so). My last attempt had a foundation of seven principles and dozen topics in the core. I think I can slim that down even further; in fact, I think three big principles form the foundation of computational thinking:
I also think we can reduce the core topics to just nine, though I can already hear protests from the back of the room about some of the omissions. I got this list by asking, “What’s the minimum I think a graduate student needs to know to contribute to the computational work in a typical lab?” My answer is:
ls and cd to sort and grep); files and directories; the pipe-and-filter model.find; shell scripts (particularly for loops); SSH.If we use a two-day boot camp to start, and follow up over six weeks with one lesson per week, I think we can cover:
| Topic | Boot Camp | Online | |
| 1. | Unix shell | ls and cd;files and directories |
sort and grep; pipes |
| 2. | Version control | update/edit/commit; merge | rollback |
| 3. | Core programming | all of it (but see below) | not needed (but see below) |
| 4. | Validation | unit tests; TDD | defensive programming; error handling; data validation |
| 5. | Program construction | One extended example; one demo of a debugger |
More examples; design for test; first-class functions |
| 6. | Associative data structures | none | everything |
| 7. | Databases | none | everything |
| 8. | Development methodologies | overview of agile | sturdy (plan-driven) lifecycle; evidence-based software engineering |
Topic #3, core programming, is the hardest to manage. If people have programmed in Python before, it can be a quick review (or omitted altogether). If they’ve programmed in some other interactive language, it can also be covered pretty quickly, but if they’ve never programmed before, or took one freshman course ten years ago, there’s no way to teach them enough to make a difference in half a day. Even if there was, the other learners would undoubtedly be bored. The only solutions I can see are to restrict participation to people who can already do a simple exercise in some language, or to run one day of pre-boot camp training for non- or weak programmers. Neither option excites me…
Coming back to content, this plans means that we’ll leave out a lot of useful things:
We’re going to start implementing this plan (or some derivative of it) at the beginning of February, to be ready for workshops starting at the end of that month. We’d welcome feedback; in particular, have we taken something out of the core that you think is more important than something that’s in, and that could be taught in the time that’s actually available? If you have thoughts, please let us know.
We just wrapped up the first boot camp of 2012 at the Space Telescope Science Institute. 14 scientists with a wide variety of computational backgrounds spent two days learning about testing, version control, program structure, the basics of Python, and the psychology of learning and programming. We’re following up with 6 weeks of online material, partly because that’s what fits everyone’s schedules, and partly to see whether a blended approach works better than either strategy on its own.
And on a completely different topic, this diagram from the Discover magazine web site sums up every scientist-vs-journalist debate ever:

I’ve been teaching scientists to program since 1998 (or 1986, if you want to start with my first lunch-and-learn for grad students in physics at the University of Edinburgh). Technology has advanced by leaps and bounds in that time, but I don’t think it’s any easier than it used to be to get basic software skills into people’s heads. What makes it hard?
Programming is intrinsically difficult. It’s fashionable to claim otherwise, but abstract thinking is a fairly recent innovation in evolutionary terms, and our brains still find it hard. On the other hand, I don’t believe that state machines and data transformations are any harder than high school algebra, and everyone we’re trying to help has long since mastered that.
Today’s languages and tools make it more difficult. Setup (particularly installation) is, if anything, harder than it was twenty years ago, and even the cleanest languages are full of accidental complexity (particularly in their libraries). (And if you think otherwise, try running a programming workshop for non-programmers working on half a dozen different operating systems, with two or three slightly different versions of your favorite language installed, and then post your dissenting comment.) It’s heartening to see that people are finally reviving research from the 1970s and 1980s into the usability of programming languages, but as we found out the hard way, it will be a long time before computer “scientists” start accepting scientific answers to these questions, much less acting on them.
Our students’ diverse backgrounds make teaching more difficult too. Our recent workshop at the University of Toronto had students from linguistics through chemistry to astronomy. Some of them had never used a command shell before; others were their labs’ unofficial sys admins, and we saw similar variation in almost every other aspect of their computing knowledge. The solution, of course, would be to divide them into levels by topic, but—
We don’t have resources to teach widely or deeply. Tens of thousands of people could teach scientists and engineers basic computing skills [1], but we have no way to reach them—yet. One of our goals for the next six months/five years is to increase the pool of instructors by several orders of magnitude [2]. Even on a five-year timescale, though, we’ll have to continue to rely mostly on volunteers, because—
There’s no room for computing in the curriculum. More precisely, faculty won’t make room, because they think computing is less important than thermodynamics, phonology, or whatever other subjects make up the core of their discipline. I used to grumble about this, but I now accept that it’s a rational choice: unless and until journal reviewers and grant agencies start asking hard question about how scientists produce their computational results, investing time in improving computational skills is a cost with uncertain rewards. And yes, there are a few exceptions here and there, but until we move to five- or six-year undergraduate degrees, they’ll continue to be exceptions. Realistically, I think the best we can hope for in the next decade is that computing has the same standing as statistics, i.e., everybody has to know the basics because their other work depends on it, but more advanced knowledge is acquired on a discipline-specific need-to-know basis.
Follow-through is hard. OK, so you just spent a couple of days at some kind of workshop: what now? If you’re lucky, you learned enough about Python or the shell to start automating a few data analysis tasks, so a positive feedback loop will kick in. But if the problem in front of you is to speed up 80,000 lines of legacy C++, those two days probably aren’t going to make a big difference. Yes, there are a lot of tutorials online that are supposed to help you, but in practice, you’ll probably find those more frustrating than anything else they assume a lot of background knowledge you don’t have, so you’re not sure which ones actually move you closer to your goal. The proposed Computational Science area at Stack Exchange might help here, if it takes off, and we’re hoping that running lesson-a-week online classes after workshops will help too, but it will always be hard for people to find time for “deep” learning, which is precisely what will make the next problem they run into easier to solve.
Most of today’s online teaching tools implement bad models of teaching. We’ve known for decades—literally, decades—that watching a video and doing some exercises is a lousy way to teach (see recent posts by Frank Noschese and Scott Gray for discussion). In programming terms, the root of the problem is that canned instruction assumes the teacher can accurately predict how learners are going to interpret and mis-interpret lessons—in software engineering terms, it’s plan-driven rather than adaptive. In practice, different learners will mis-interpret lessons in different (and hard-to-predict) ways; in order to be effective, teaching needs some sort of agile feedback loop to correct for this, but that’s exactly what most approaches to web-based teaching take out of the equation [3].
So, is it hopeless? Of course not: over the next six months, and (hopefully) the next five years, I believe we can make real progress on several fronts. We can certainly recruit and train more workshop organizers and instructors, and experiment with different kinds of online learning to see which will make follow-through easiest and most effective (which in turn depends on us coming up with ways to assess the impact we’re having). If you’d like to help, please get in touch.
[1] I get “tens of thousands” by taking a million competent programmers, multiplying by 10% (the proportion who can teach), and then multiplying by 10% again (the proportion who might be interested). Your made-up stats may vary.
[2] The other reason this has to be a priority is that our learners’ needs are as diverse as their backgrounds. Our learners want to jump straight from “what’s a for loop?” to “how do I detect glottal stops in lo-fi audio?” or “how do I visualize turbulent flow of interstellar gas?” We’re never going to be able to cover these with just a handful of content creators.
[3] Note that I’m using “online” to mean recorded and/or automated, i.e., things that learners can do when they want. Other approaches that deliver traditional lectures or seminars over the web synchronously and interactively are a bit better, but don’t scale: no webinar system I’ve ever seen gives the instructor the kind of feeling for the room that s/he’d get in a regular lecture hall.
We’ve just added a single-page description of the two-day boot camps we’re planning to run in the next six months. In brief, their aim is to ensure that people have a few core skills, so that they can tackle our online material productively, and to help them get past startup hurdles such as software installation. If you have questions, comments, or suggestions, please add them to that page; if you’d us to help you organize and run a boot camp, please get in touch.