Our Long Tail

January 26th, 2012 1 comment

Frequency a page is viewed (as a percentage of total views) vs. pages (ordered by frequency); data taken over the last 90 days. I guess there’s something to this “long tail” stuff after all.

Categories: Noticed Tags:

Never Mind the Content, What About the Format?

January 26th, 2012 1 comment

I’m still gnawing on the problem of how to construct content for 21st Century learning—or, more prosaically, what I should use to build the next version of Software Carpentry. My starting point is the need to serve several different kinds of users [1]:

  • Zuzel likes textbooks. More specifically, she likes prose that she can read and re-read at leisure. She’ll use a tablet, but would rather listen to music while she learns than to someone lecturing.
  • Yeleina prefers interactive learning. She wants to see things evolve on the whiteboard or on the screen; recordings of live coding sessions with voiceover are OK, but slide after slide of bullet points puts her to sleep. She also wants to be able to share ideas about learning content with her peers.
  • Xanthe is a surfer, not a diver—she wants to skim the pages that Giggle searches turn up and piece things together herself. Bullet points and brief sentences work well for her, particularly in areas she’s already familiar with.
  • Wafiya teaches programming at a city library. She needs to remix content created by other educators to meet her learners’ needs, but doesn’t have a lot of time to do so. She also needs to be able to find content that fits into her (individualized and group) learning plans.
  • Veronique is a programmer who is passionate about teaching. She spends several hours a week writing short tutorials, answering questions in Stuck Underflow and other online forums, and occasionally recording screencasts. She’d like to do more with less effort (she finds today’s tools frustrating), to make the content she’s creating more useful, and to get more feedback from its users.

Their learning options today are:

  • Textbook: big blocks of prose in some narrative order, with pictures, either printed or electronic, read at the learner’s pace, alone.
    • Zuzel likes this.
    • Yeleina doesn’t.
    • Xanthe uses content out of order via the index or search bar.
    • Wafiya remixes content from several textbooks to create lessons (by photocopying, merging PDFs, or whatever). Like Zuzel, she has read content in order, but like Xanthe, she mainly uses the index now.
    • Veronique has thought about writing one, but (a) doesn’t think she has that much to say about any single topic, and (b) is put off by the effort that would be required.
    • Note: the comments below about the difficulty of copying, pasting, and altering also apply to electronic textbooks, as do the proposed remedies.
  • Static slideshow: a page-by-page dump of a PowerPoint deck, possibly accompanied by a transcript of what the lecturer would say when delivering it.
    • Zuzel uses this as if it were a badly-written textbook, with the transcript as the prose and the slides as diagrams.
    • Yeleina finds it distracting to switch attention back and forth from slides to transcript.
    • Xanthe searches the transcript to find what she wants, then curses because her search engine can’t “see” the text in the slides. She also hates the fact that she can’t copy and paste the code in the slides (since they’re PNGs embedded in a web page).
    • Wafiya remixes this content like any other. She’s too polite to curse, but she finds it tedious to re-type the code that’s shown in the slides (but isn’t duplicated as text in the accompanying transcript). She’s also finds it wearying to have to re-do diagrams: since the slides are PNGs, it’s difficult for her to copy part of a slide, move its elements around, and add a few of her own.
    • Veronique doesn’t create material in this format because she thinks it’s old-fashioned and not useful.
    • Note: source code can be made available as copy-and-pasteable text directly in the page, or for download; diagrams can similarly be made available as SVGs to facilitate remixing. Doing either currently requires considerable extra work on the part of content creators.
  • Voice-over slideshow screencast: a video recording of the slides (as they would appear on screen in a lecture) with someone speaking over them, and subtitles.
    • Zuzel ignores the video and reads the transcript as if it were a static slideshow. If a transcript isn’t available, she (reluctantly) watches the video.
    • Yeleina prefers this to a static slideshow, but prefers the doodling screencast described below even more.
    • Xanthe hits the “back” button as soon as she realizes it’s a video (unless there’s a transcript, in which case she curses because she can’t copy and paste code out of a video).
    • Wafiya directs students like Yeleina to these, but finds them harder to remix than other formats.
    • Veronique thinks this format is also old-fashioned and not useful.
    • Note: I’m assuming the subtitles are duplicated as a transcript, or available in some other searchable form. I’m also assuming that code is available for donwload or duplicated in the page for coying and pasting, though all of this requires extra work.
  • Voice-over doodling screencast: a Khan Academy-style recording of someone doodling on a tablet or coding live.
    • Zuzel treats this like a slideshow screencast.
    • Yeleina likes this format a lot, particularly if she can add comments at specific points and see her peers’ comments.
    • Xanthe has mixed feelings: she dislikes explanations delivered this way, but frequently watches “how to” videos, since they’re more likely to be accurate and complete than written descriptions.
    • Wafiya treats these like slideshow screencasts.
    • Veronique creates these fairly regularly: they’re easy to do, and easy to re-do when systems change or she discovers a mistake.
    • Note: I’m making the same assumptions about transcripts, code, and diagrams as above.
  • Recorded whiteboard lecture: someone with a camera has recorded someone giving a lecture in a lecture hall, and spliced that with whatever was on the lecturer’s screen.
    • Zuzel treats this like any other screencast.
    • Yeleina prefers this to doodling screencasts because she can see the speaker’s body language.
    • Xanthe treats these like any other screencast, i.e., she’ll use it if there’s a searchable transcript and things to copy and paste, or if it’s a recording of a live “how to” coding session.
    • Wafiya treats these like slideshow screencasts.
    • Veronique doesn’t create these, partly because of the setup required, but also because she doesn’t think seeing her adds value—the lesson’s supposed to be about the stuff.
    • Note: I’m assuming an electronic whiteboard, since video of someone writing on an actual whiteboard is usually illegible.
  • Radio drama: a voice-only podcast-style presentation.
    • Zuzel ignores the audio and reads the text transcript.
    • Ditto for Yeleina.
    • Ditto for Xanthe.
    • Ditto for Wafiya.
    • Veronique doesn’t create these.
    • Note: but for Ursula, who is blind, this is the only format—all the others fold into it. She doesn’t need code samples as text for copying and pasting: she needs them so that her screen reader can tell her what’s that code contains.
  • Star Wars: high-quality video with custom animations, cut scenes, and other special effects.
    • Zuzel watches these sometimes, but doesn’t learn any more from them than she would from a slideshow.
    • Yeleina enjoys these, which means she pays more attention to them, which means she learns more (but no more than she’d learn from an engaging lecturer).
    • Xanthe doesn’t see the point. Unless something blows up.
    • Wafiya likes their high production values, and remixes the special effects segments frequently.
    • Veronique can’t afford to produce this kind of material.
    • Note: again, I’m making assumptions about transcripts, copy-and-pasting, etc.
  • Write your own adventureexploration: typically a set of connected ideas or challenges with explicit dependency information (i.e., you should/must learn A and B before tackling C).
    • Zuzel finds the lack of narrative difficult.
    • Yeleina enjoys these if each node in the graph is in one of her preferred formats. She enjoys them even more if she is exploring with peers.
    • Xanthe ignores the ordering and searches for what (she thinks) she needs. If content is locked down—i.e., if the system won’t let her see or search C until she’s “completed” A and B—she writes an angry tweet and moves on.
    • Wafiya likes this format for several reasons, but only if everything is always visible. First, it tells her how other teachers think ideas connect (something that is missing or out-of-band for other delivery formats). Second, it’s easy to remix: again, providing it’s open, she can reorder things as she thinks best for particular learners.
    • Veronique would like to do this, but has discovered that creating the metadata about dependencies and recommended paths is as hard as writing a textbook.
  • Wander aroundexploration: lots of little snippets, but no explicit dependency information.
    • Zuzel finds this even more difficult.
    • Yeleina likes this less than the “write your own adventure” format: she thinks it’s no different than just using Giggle to find things.
    • Xanthe likes this because it’s just like using Giggle searches. In fact, she uses every other format as if it were this one.
    • Wafiya feels the same way as Yeleina: she likes having stuff to remix, but she has to do that remixing before this material is useful to those of her students who aren’t as independent as Xanthe.
    • Veronique creates content like this almost without realizing it by answering questions at Stuck Underflow.
  • Jam session: a bunch of learners in a room working through material simultaneously.
    • Zuzel doesn’t like it: it’s too noisy for her to concentrate, and she can’t go back at her own pace to review.
    • Yeleina thinks this is the best… thing… ever.
    • Xanthe is Giggling for information as soon as the presenter tells people what the topic is, but will stop and watch carefully when the presenter is typing live on screen.
    • Wafiya can only book space to do this occasionally, and even then, she doesn’t enjoy improv teaching.
    • Veronique enjoys doing this—she’s volunteers with a local free-range learning group—but can only find time once every couple of months.
    • Note: in theory this can be combined with any of the formats above. In practice, it’s almost always short, live lectures interspersed with hands-on practical work.
  • Personal tutoring: one-to-one instruction, a.k.a. “pair learning”.
    • Zuzel doesn’t mind this, but she really does prefer books…
    • Yeleina actually prefers jam sessions, since they tend to be more lively.
    • Xanthe likes having someone available to answer questions on demand, but is happy Giggling on her own for most of what she needs.
    • Wafiya wishes she could do this with every one of her learners, but there simply aren’t enough hours in the day. The personalized lesson plans she draws up are the closest approximation she can manage.
    • Veronique does this a lot, but is frustrated that it doesn’t scale—she really wants to help more than one person at a time.

I’m sure some of the above is inconsistent or just plain wrong, but here are my takeaways:

  1. Different people want content in different formats. Yeah, OK, we knew that already, but:
  2. Everybody needs first-class content, in the programming sense of the term. In practice, it means that every kind of content can be copied and pasted without losing its meaning. A bunch of colored pixels in an image that look like letters aren’t actually letters; if you copy a region of an image and paste it into a text editor, you don’t get the text [2]. Similarly, search engines like Giggle can’t “see” code evolving line-by-line in a video, so you can’t search for that. Together, I think that point #1 and point #2 imply that:
  3. We need model-view separation in learning content. I apologize for the computerese, but I don’t know any other way to say it. A model (more fully, data model) is how information is stored, while a view is how people interact with it. Models should be designed to be easy for computers to work with; views should be designed to meet human needs, and the plural there is important: different people want to interact with information in different ways, and even a single person may want to use different ways at different times. Search engines want the information that’s in the model, such as the captions on the boxes in a diagram, not some arbitrary view of it (like a bunch of pixels in a PNG). People usually want that as well when they’re remixing, since their goals are to combine that information with information from other sources, and/or to present that information in different ways (i.e., views).
  4. We also need first-class metadata. I haven’t been able to find a standard format for summarizing and exchanging lesson objectives, learning dependencies, and everything else needed to stitch individual facts together. The closest thing seems to be SCORM, but I’d rather stick a fork in my eye [3]: it’s bloated, it mixes data models with meta-models with presentation layers with everything else its authoring committee could think of, and did I mention the fork? I could provide metadata as data, e.g., put a point-form list at the top of a lesson saying, “Here’s what you need to know before tackling this,” but that mixes model and view: since it’s just a convention, computers will have a hard time stitching things together accurately.
  5. Finally, we need social learning. Even the Zuzels of this world learn best in collaboration with other people: peer learners are often better at understanding and clearing up misconceptions than instructors, and having a “running partner” helps people stay focused and motivated. This isn’t really a matter of format, though, but of the tooling used to deliver content, so I’ll skip over it below.

OK, so how well do today’s tools and/or formats do by these measures? The fact that “PowerPoint” is both a tool and a format is one indication that the answer is going to be, “Not well.”

  • Plain textis highly searchable and remixable, plays nicely with accessibility aids (i.e., screen readers), and runs everywhere, but:
    • it doesn’t do diagrams (unless you count ASCII art);
    • it doesn’t directly support metadata (except by convention); and
    • it doesn’t separate models from views
  • HTML is also highly searchable and remixable—until you start doing dynamic updates with Javascript, at which point today’s search tools (and accessibility aids) can’t keep track of what’s going on. Unlike text, it provides standard ways to include other media, so we’ll delay discussion of images and video. And while it doesn’t offer a standard way of providing metadata out of the box, HTML5′s custom data attributes were designed with exactly this kind of use in mind. And modern HTML partially separates models from views: I can use CSS to tell the rendering engine (e.g., a web browser) to display things differently for different use cases.
  • DocBook, LaTeX, and wiki text separate models from views even more than HTML does. What’s in the file is a description of content, information about the content, and just enough formatting to make things pretty when viewed in specific ways, e.g., “Break the page here to avoid an orphaned line.” Diagrams and metadata can be handled the same way as it is for HTML; in fact, I can’t see any advantage these formats have over modern HTML any longer [4], so I’m going to take them off the table.
  • PNG and other raster formats: fail the searchability and copy-and-paste tests.
  • SVG and other vector formats: do better. Since (some of) the content and relationships are explicit, search engines can find things in SVGs, and you can actually select and copy a box or an arrow, rather than a region of pixels. It only goes so far—Visio-style information about “this arrow connects the box labeled A to the box labeled B” is mostly implicit—but it’s better than raster. I’ve seen people do entire lessons as a series of SVGs, or as one large SVG with progressive reveal; I’ll talk about this more below.
  • PowerPointand its kin: model, view, and authoring tool are inextricable from one another. You can copy and paste things, and modern search engines understand the format well enough to index textual content, but metadata is just a convention, and remixing takes a lot of work (even if the version you have is the original, rather than an exported ZIP file containing an HTML page that references PNG representations of the slides). That said, authoring rich presentations is easier than it is with HTML+SVG:
    1. You use the same tool to create textual and graphical content, rather than having to switch between tools and stitch content together.
    2. You can connect textual and graphical content, i.e., you can draw a circle around a word in one of your bullet points, then connect it with an arrow to a particular box in a diagram, just as you would when writing freehand on a whiteboard. This is what HTML-based slideshow packages lack: right now, they force authors to segregate text and graphics, which I view as a throwback to the era of hot metal typesetting.

    The fact is, most presenters continue to use PowerPoint (or something similar) because it makes it easy to create a reasonably good presentation in a reasonable amount of time [5]. HTML slideshow packages fail this test: authors must sacrifice the quality of the presentation (e.g., skip graphics, or embed segregated graphical files), and do a lot of non-content typing (tags, page IDs, and so on).

  • Video: fails all the “first-class content” tests [6], and isn’t effective [7] unless:
    • authors have the resources to produce Star Wars-quality content [8], or
    • they’re showing learners how to do something, like dissecting a frog or using a debugger.

So after all of this, what do I actually want?

  1. I want content stored in HTML5 with purely semantic markup, so that it can be searched, copied and pasted, and styled for presentation in a variety of ways [9].
  2. I want an agreed-upon meta and data-* vocabulary for educational metadata, like dependencies, introduction of key terms, questions and answers, and so on. I want a similar vocabulary for commenting and other social interactions that plays nicely with things like the Salmon protocol.
  3. I want an authoring tool (note the singular there) that lets me:
    1. write and draw WYSIWYG instead of typing in tags and IDs;
    2. freely mix drawings and text; and
    3. manage parallel streams (or channels), so that I can keep slide content, presenter’s notes, prose, and translations of all three into other languages together.
  4. I want to be able to animate my drawings and text, which is emphatically not the same as “embed video” (though I may want to do that too). Instead of recording the pixels drawn on the screen as I type Python into an editor, I want to record and play back the text that’s being created, so that learners can pause the animation, copy the text, and paste it somewhere else. Equally, instead of painting pixels to fool your eyes into believing that a box just moved off the screen, I want to move the damn box; once again, if you pause the animation, you should be able to click on the box, attach a comment to it, paste it into your own drawing, etc.

Freeling mixing drawings and text feels like it ought to be doable today: we could either put the text in blocks inside a canvas element, or layer a transparent canvas over the page and dynamically resize it. Anchoring drawings to the underlying text (e.g., keeping the arrow from a term to the corresponding bit of the diagram in the right place) is “just” Javascript (for some value of “just”). Making it all WYSIWYG is just more Javascript [10].

But animation… Ah, that’s a big one. It’s an intrinsically hard problem, but canned effects can do a lot to put simple things within reach [11]. The big question is, how far do we push it? If I want to show you how to use a debugger, or how to draw something with a painting program, I can’t re-create the whole UI—I’m going to have to record pixels off a screen.

Or am I? I know this is never going to happen—we’re not that organized a species—but just imagine what the world would be like if every interface was built using HTML5 and CSS. Any tool at all could export widget descriptions and a semantic trace of what they did (i.e., “the file menu was pulled down” rather than “the cursor moved to pixel (132,172) and the user clicked”), and any other tool could consume it and play it back. The consuming tool might draw the widgets differently, or display the interactions in its own way, but that would be exactly the same as applying a different skin to the original tool [12].

Returning to this universe for a moment, we can store things as HTML5 right now—I’m already using it for Version 5 of Software Carpentry. I could create a vocabulary for instructional metadata, but I’m not an information architect. WYSIWYG authoring tools for HTML5 abound, though the HTML5 they produce can be idiosyncratic (and doesn’t play nicely with version control, but that’s fixable). I haven’t seen a WYSIWYG tool that supports freehand drawing mixed freely with text, or one that supports parallel content streams, but I think half a dozen people working could deliver something substantial in half a dozen months [13].

As for animation, I think we’re stuck with video for now: prototyping an HTML5/SVG/Javascript animation framework for use in a learning tool would be a great research project, but we really do need to build a couple to throw away to find out if it’s workable. If you’d lke to tackle it, please let me know—I’d be happy to be your alpha tester.

Notes

[1] There was a lot of talk in the 1980s and 1990s about different people having different learning styles, inspired in part on Gardner’s theory of multiple intelligences. The idea has mostly been discredited, but like many memes, it lives on in popular culture.

[2] Although I bet someone’s working on an Emacs mode to do that…

[3] I’ve actually done this, so I know whereof I speak.

[4] Except that LaTeX and wiki text require slightly less typing than HTML, but if you’re using a smart editor, even that advantage goes away.

[5] Please don’t quote Tufte’s complaints about PowerPoint at me—I don’t think it encourages bad presentations any more than the tangled rules of English spelling and grammar encourage bad writing.

[6] In particular, almost all video content makes life harder for the visually impaired: a screencast in which someone talks over themselves typing in an editor or sketching on a tablet is tantalizing but useless to someone who can’t see the pixels. I committed this sin when I created Version 4 of Software Carpentry; I’d like to do better in Version 5, and would like to see high-profile online learning sites make some kind of effort as well.

[7] But wait a second: if video isn’t effective, why do MIT Open Courseware and the Khan Academy work so well? The short answer is, they mostly don’t: if you take out the 15% of people who can learn almost anything, no matter how it’s presented, watching videos and doing drill exercises works less well than other options. The longer answer is, watching a good teacher (and Khan is a great teacher) work through a problem, instead of just presenting the answer, moves the content into the “how to” category that video is well suited to.

[8] Research dating back to the early 1990s shows that higher-quality material improves student retention. I don’t know whether it improves it enough to justify its higher production costs, though.

[9] HTML5 will also help with version control, since I expect HTML5-aware diff-and-merge tools to start appearing Real Soon Now. Of course, I’ve been saying that for almost ten years…

[10] These days, you can wave away almost any technical objection with “it’s just more Javascript”.

[11] In my mind, the animation interface looks more like Scratch than it does like PowerPoint’s menus and dialogs. It definitely doesn’t require people to type in code, unless they want to create and share an entirely new kind of animation effect.

[12] We could even call that format XUL

[13] “6×6″ is as big a team/timescale as I’m able to contemplate these days.

Categories: Content, Education, Version 5.0 Tags:

The Big Picture

January 25th, 2012 3 comments

I’m trying to be systematic about re-designing the core curriculum of Software Carpentry. So far, I’ve identified 11 common questions:

Q01: How can I write a simple program?
Q02: How can I make the program I’ve written easier to reuse?
Q03: How can I reuse code that other people have written?
Q04: How can I share my work with other people?
Q05: How can I keep track of what I’ve done?
Q06: How can I tell if my program is working correctly?
Q07: How can I find and fix bugs when it isn’t?
Q08: How can I get data into my program?
Q09: How can I manage my data?
Q10: How can I automate this task?
Q11: How can I make my program faster?

whose answers depend on three fundamental principles:

F01: It’s all just data.
F02: Programming is a human activity.
F03: Better algorithms are better than better hardware.

These break down into 11 more specific principles:

P01: Code is just a kind of data.
P02: Metadata makes data easier to work with.
P03: Separate models and views.
P04: Trade human time for machine time and vice versa.
P05: Anything that’s repeated will eventually be wrong somewhere.
P06: Programming is about creating and composing abstractions.
P07: Programming is about feedback loops at different timescales.
P08: Good programs are the result of making good techniques a habit.
P09: Let the computer decide what to do and when.
P10: Sometimes you copy, sometimes you share.
P11: Paranoia makes us productive.

which in turn translate into 11 recommendations:

R01: Use the right algorithms and data structures.
R02: Use a version control system.
R03: Automate repetitive tasks.
R04: Use a command shell.
R05: Use tests to define correctness.
R06: Reuse existing code.
R07: Design code to be testable.
R08: Use structured data and machine-readable metadata.
R09: Separate interfaces from implementations.
R10: Use a debugger.
R11: Design code for people to read.

Here’s how I see all this mapping onto the curriculum (assuming we replace agile development with number crunching):

  • The Shell: files and directories; creating things; pipes and filters; permissions; shell scripts; finding things; variables; loops
    • Q03: How can I reuse code that other people have written?
    • Q10: How can I automate this task?
    • P04: We can trade human time for machine time and vice versa.
    • P06: Programming is about creating and composing abstractions.
    • R03: Automate repetitive tasks.
    • R04: Use a command shell.
    • R06: Reuse existing code.
  • Version control: update, edit, commit, and history; merging conflicts; recovering old versions; setting up a repository
    • Q04: How can I share my work with other people?
    • Q05: How can I keep track of what I’ve done?
    • Q09: How can I manage my data?
    • F01: It’s all just data.
    • F02: Programming is a human activity.
    • P01: Code is just a kind of data.
    • P02: Metadata makes data easier to work with.
    • P05: Anything that’s repeated will eventually be wrong somewhere.
    • P07: Programming is about feedback loops at different timescales.
    • P11: Paranoia makes us productive.
    • R02: Use a version control system.
    • R03: Automate repetitive tasks.
    • R08: Use structured data and machine-readable metadata.
  • Basic Programming in Python: variables and assignment; repeating things; lists; reading and writing; conditionals; nesting control structures; design patterns
    • Q01: How can I write a simple program?
    • Q02: How can I tell if my program is designed well?
    • Q08: How can I get data into my program?
    • P04: We can trade human time for machine time and vice versa.
    • P05: Anything that’s repeated will eventually be wrong somewhere.
    • P06: Programming is about creating and composing abstractions.
    • R01: Use the right algorithms and data structures.
    • R11: Design code for people to read.
  • Interlude: aliasing
    • P10: Sometimes you copy, sometimes you share.
  • Interlude: text
    • F01: It’s all just data.
  • Interlude: Booleans and while loops
    • R11: Design code for people to read.
  • Interlude: Using a debugger
    • Q01: How can I write a simple program?
    • Q07: How can I find and fix bugs when it isn’t?
    • F01: It’s all just data.
    • R10: Use a debugger.
  • Functions and Libraries in Python: how functions work; aliasing (again); multiple arguments; returning values; libraries; standard libraries; functions as objects
    • Q01: How can I write a simple program?
    • Q02: How can I tell if my program is designed well?
    • Q02: How can I make the program I’ve written easier to reuse?
    • F01: It’s all just data.
    • P05: Anything that’s repeated will eventually be wrong somewhere.
    • P06: Programming is about creating and composing abstractions.
    • P10: Sometimes you copy, sometimes you share.
    • R06: Reuse existing code.
    • R09: Separate interfaces from implementations.
    • R11: Design code for people to read.
  • Interlude: provenance
    • Q05: How can I keep track of what I’ve done?
    • Q09: How can I manage my data?
    • Q10: How can I automate this task?
    • F01: It’s all just data.
    • P09: Let the computer decide what to do and when.
    • R03: Automate repetitive tasks.
    • R08: Use structured data and machine-readable metadata.
  • Program Development: creating a grid; randomness; neighbors; handling ties; putting it all together; fixing bugs; refactoring
    • Q01: How can I write a simple program?
    • Q02: How can I tell if my program is designed well?
    • Q11: How can I make my program faster?
    • F02: Programming is a human activity.
    • P06: Programming is about creating and composing abstractions.
    • P07: Programming is about feedback loops at different timescales.
    • P08: Good programs are the result of making good techniques a habit.
    • R01: Use the right algorithms and data structures.
    • R06: Reuse existing code.
    • R07: Design code to be testable.
    • R09: Separate interfaces from implementations.
    • R11: Design code for people to read.
  • Interlude: configuring programs
    • F01: It’s all just data.
  • Interlude: assertions; exceptions
    • P11: Paranoia makes us productive.
  • Testing: goals; tests as specifications; structuring unit tests; using a unit testing framework; design for test
    • Q02: How can I tell if my program is designed well?
    • Q06: How can I tell if my program is working correctly?
    • Q07: How can I find and fix bugs when it isn’t?
    • Q10: How can I automate this task?
    • F02: Programming is a human activity.
    • P01: Code is just a kind of data.
    • P07: Programming is about feedback loops at different timescales.
    • P08: Good programs are the result of making good techniques a habit.
    • P09: Let the computer decide what to do and when.
    • P11: Paranoia makes us productive.
    • R03: Automate repetitive tasks.
    • R05: Use tests to define correctness.
    • R07: Design code to be testable.
  • Sets and Dictionaries: sets; storage; dictionaries; simple examples; longer examples
    • F03: Better algorithms are better than better hardware.
    • Q11: How can I make my program faster?
    • R01: Use the right algorithms and data structures.
  • Interlude: numbers
    • F01: It’s all just data.
  • Number Crunching; basics; indexing; linear algebra; making recommendations; statistics
    • Q03: How can I reuse code that other people have written?
    • Q11: How can I make my program faster?
    • F03: Better algorithms are better than better hardware.
    • P04: We can trade human time for machine time and vice versa.
    • P09: Let the computer decide what to do and when.
    • R01: Use the right algorithms and data structures.
    • R06: Reuse existing code.
  • Databases: selecting; removing duplicates; calculating new values; filtering; sorting; aggregation; joins; missing data; nested queries; transactions; programing with databases
    • Q08: How can I get data into my program?
    • Q09: How can I manage my data?
    • F01: It’s all just data.
    • P02: Metadata makes data easier to work with.
    • P03: Separate models and views.
    • P05: Anything that’s repeated will eventually be wrong somewhere.
    • P09: Let the computer decide what to do and when.
    • R08: Use structured data and machine-readable metadata.

Comments and suggestions would be very welcome.

Categories: Version 5.0 Tags:

Take Out Agile, and Add…What?

January 24th, 2012 17 comments

Based on the feedback we’ve received so far (both as comments and by email), it looks like we should take development methodologies (i.e., agile development) out of the core curriculum and replace it with two hours on:

  1. Nothing: there’s already too much in the core.
  2. Spreadsheets: because many scientists use them badly.
  3. NumPy and/or Pandas: because many of them are crunching matrices/doing stats.
  4. Visualization: which in practice would mean the basics of matplotlib.
  5. Image manipulation: because it’s fun as well as useful, and lets us talk about binary vs. text data.

I am quite arbitrarily limiting options to those five. Please cast your vote (one vote, not three out of five) in comments. We’d be grateful if you could include a brief explanation as well.

Categories: Content, Version 5.0 Tags:

Test-Driven Public Speaking

January 24th, 2012 No comments

Once again, Cameron Neylon explains things much better than I ever could:

“The impact factor of a journal is a better predictor of the chances of a paper being retracted than…of the number of citations.”

Categories: Noticed Tags:

Badging

January 24th, 2012 3 comments

One of the things we need to do in the next six months along with running boot camps and updating our online content is to create some sort of badging to recognize people’s skills and contributions. As we said in the proposal to the Sloan Foundation, “A badge program will provide near-term incentives for both learning and mentoring; a framework to support viral, peer-driven engagement with the program; and facilitate recognition by partner institutions and potential employers.”

We’re going to rely on Mozilla’s Open Badges project to handle the mechanics of storing and validating badges, so we only have three questions to answer:

  1. What do we award badges for?
  2. How do we determine that someone has earned one?
  3. What do they look like?

The obvious answer to the first (and most important) question would be, “You get a badge for completing the core curriculum.” However, one of the purposes of badging is to provide a finer-grained inventory of people’s knowledge and skills, so there’s an argument to be made for giving one badge per topic, e.g., a version control badge, a Unix shell badge, a basic imperative programming badge, and so on.  The argument for is that their meaning will be clearer: if I say, “Jane knows the basics of Subversion,” that’s more immediately understandable than, “Jane has completed the core of Software Carpentry.” The argument against is that if someone has collected two hundred small badges, we’re going to aggregate them anyway (“Jane knows basic software development skills”), so why not just do that in the first place.

I’ve gone back and forth on this, but currently think that one badge for the core curriculum (“Basic Software Carpentry”) will work best. We will offer two other badges as well: one for organizing a boot camp, and one for contributing a medium-sized chunk of content (on the scale of one 5-minute video episode).

Having decided that, the next challenge is to determine when someone has earned a particular badge. The “Boot Camp Organizer” and “Content Contributor” badges are straightforward; telling when someone has mastered the core skills is not. We can tell that you’ve attended the boot camp and viewed the videos, but how can we tell how much you’ve actually learned?  “Solve this problem and email us the result” isn’t good enough: you could get someone to do it for you [1], and even if you’re honest, we can’t tell how quickly you did it, how many blind alleys you went down, how often you did something in ten steps instead of one, and so on. In the short term, I think the solution is to do assessment in real time using desktop sharing, i.e., you share your desktop with me, I give you the problem to solve, and I watch you do it.  This won’t scale to hundreds or thousands of learners, but it’ll get us through the next six months.

What will badges look like? A badge is just a small PNG file with a digital signature embedded in it (it’s a neat little hack), so the graphic design is up to us. I like our current logo, but (a) it doesn’t size down well, and (b) I’ve been wanting to redesign it anyway, since the blue-to-white fade in the background doesn’t print well on t-shirts, coffee mugs, and other media. In keeping with our carpentry theme (“We’re not teaching people how to build the Channel Tunnel, we’re teaching them how to hang drywall”), I’d like an image that combines tools like hammers and saws with something like 1′s and 0′s to represent software, but I’m a lousy graphic designer—if any of our readers would like to take a crack at it, please let me know.

Finally, and most importantly, how can we get existing institutions—specifically universities—to recognize badges in some way? As much as we’d like people to value skills for their own sake, everyone is always busy, and always has more to do than time to do it. Can we persuade a few schools to list badges as non-credit items on students’ transcripts (just as they might presently list a short course in presentation skills or entrepreneurship that doesn’t count toward degree requirements, but required some work on the student’s part)?  If so, it would give people an extra incentive to complete the core curriculum, organize a workshop, or create some content for us, particularly in a tight job market where every small distinction counts.

[1] It’s unlikely that someone would cheat on a Software Carpentry exercise, but in general, if badges take off and actually start to matter, the people who sell college students essays on Steinbeck at $30 a pop will start offering to write their online exams for $50 each.

Categories: Community, Sloan Foundation, Version 5.0 Tags:

Revising the Curriculum

January 23rd, 2012 19 comments

I’ve been thinking some more about what the foundation and core of Software Carpentry actually are (and not just because Jon Pipitone keeps pestering me to do so). My last attempt had a foundation of seven principles and dozen topics in the core. I think I can slim that down even further; in fact, I think three big principles form the foundation of computational thinking:

  1. It’s all just data, whose meaning depends on interpretation. This subsumes the notions that programs are a kind of data (which is the basis of things as diverse as functional programming and version control), and that we should separate models from views (because the most efficient ways for people and computers to interpret data are different). It doesn’t really include the distinction of copy vs. reference, but I’m going to lump it in here because that idea doesn’t seem big enough to deserve a heading of its own.
  2. Programming is a human activity. The only way to build large programs (or even small ones) is to create, compose, and re-use abstractions, because our brains can only understand a few things at a time. Similarly, good technique (specifically version control, testing, task automation, and some rules for collaborating, be they agile or sturdy) is necessary because everyone is tired, bored, or out of their depth at least once in a while.
  3. Better algorithms are better than better hardware. Computational complexity determines what’s doable and what isn’t, and no aspect of program performance makes sense without some understanding of it.

I also think we can reduce the core topics to just nine, though I can already hear protests from the back of the room about some of the omissions. I got this list by asking, “What’s the minimum I think a graduate student needs to know to contribute to the computational work in a typical lab?” My answer is:

  1. The Unix shell
    • Includes: basic commands (from ls and cd to sort and grep); files and directories; the pipe-and-filter model.
    • Because: it’s still the workhorse of scientific computing (and is experiencing a resurgence as cloud computing becomes more popular).
    • Illustrates: “lots of little pieces loosely joined” is a good way to introduce modularity and tool-based computing; it lets us talk the human time vs. machine time tradeoff.
    • Omissions: find; shell scripts (particularly for loops); SSH.
  2. Version control
    • Includes: update/edit/commit; merge (with rollback as a special case).
    • Because: it’s a key technique.
    • Illustrates: the idea of metadata; programming as a human activity (the hour-long red-green-refactor-commit cycle).
    • Omissions: branching; distributed version control.
  3. The common core of programming
    • Includes: variables; loops; conditionals; lists; functions; libraries; memory model (aliasing).
    • Because: we can’t teaching validation, associative data structures, or program design without this common core.
    • Illustrates: programming as a human activity (programs must be readable, testable, etc.).
    • Omissions: object-oriented programming; matrix programming.
  4. Validation
    • Includes: structured unit tests; test-driven development; defensive programming; error handling; data validation.
    • Because: defense in depth is key to building large programs, and trustworthy programs of any scale.
    • Illustrates: trustworthy programs come from good technique.
    • Omissions: testing floating-point code (since we don’t really know how to).
  5. Program construction
    • Includes: piecewise refinement; refactoring; design for test; first-class functions; using a debugger.
    • Because: knowing the syntax of a programming language doesn’t tell you how to create a program.
    • Illustrates: creating and composing abstractions; interface vs. implementation.
    • Omissions: structured documentation.
  6. Associative data structures
    • Includes: sets (as a prelude); dictionaries; why keys must be immutable.
    • Because: useful in so many places.
    • Illustrates: how the right algorithms and data structures make programs more efficient.
    • Omissions: implementation details.
  7. Databases
    • Includes: select; sort; filter; aggregate; null; join; accessing a database from a program.
    • Because: useful in many contexts.
    • Illustrates: separation of models and views; a different model of computation
    • Omissions: sub-queries; object-relational mapping; database design.
    • Note: we could illustrate many of the same ideas with spreadsheets, but they’re not as easy to connect to programs.
  8. Development methodologies
    • Includes: agile practices (the usual Scrum+XP mix); sturdy (plan-driven) lifecycles.
    • Because: ties many other lessons together.
    • Illustrates: good technique makes good programs.
    • Omissions: code review.

If we use a two-day boot camp to start, and follow up over six weeks with one lesson per week, I think we can cover:

Topic Boot Camp Online
1. Unix shell ls and cd;
files and directories
sort and grep; pipes
2. Version control update/edit/commit; merge rollback
3. Core programming all of it (but see below) not needed (but see below)
4. Validation unit tests; TDD defensive programming; error handling;
data validation
5. Program construction One extended example;
one demo of a debugger
More examples; design for test;
first-class functions
6. Associative data structures none everything
7. Databases none everything
8. Development methodologies overview of agile sturdy (plan-driven) lifecycle;
evidence-based software engineering

Topic #3, core programming, is the hardest to manage. If people have programmed in Python before, it can be a quick review (or omitted altogether). If they’ve programmed in some other interactive language, it can also be covered pretty quickly, but if they’ve never programmed before, or took one freshman course ten years ago, there’s no way to teach them enough to make a difference in half a day. Even if there was, the other learners would undoubtedly be bored. The only solutions I can see are to restrict participation to people who can already do a simple exercise in some language, or to run one day of pre-boot camp training for non- or weak programmers. Neither option excites me…

Coming back to content, this plans means that we’ll leave out a lot of useful things:

  1. Spreadsheets: lots of scientists use spreadsheets badly, but while we’d like to show them how to do so well, the only one they actually use, Excel, isn’t open source or cross platform, and it’s much harder to build programs around spreadsheets than around databases.
  2. Make: is very hard to motivate unless people are working with compiled languages—we’ve tried showing people how to build data pipelines using Make, but it’s too clumsy to be compelling. Plus, Make’s syntax makes a hard problem worse…
  3. Systems programming: knowing how to walk directory trees and/or run sub-processes is useful, but we think people can pick these up on their own once they’ve mastered the core.
  4. Matrix programming: really important to some people, irrelevant to others, and the people it’s important to will probably have seen the ideas in something like MATLAB before we get them.
  5. Multimedia programming (images, audio, and video): people can learn basic image manipulation on their own; audio and video are harder, mostly due to a lack of documentation, but they aren’t important enough to enough people to belong in our core.
  6. Regular expressions: are a great way to illustrate the idea that programs are data, and are very useful, but everything in the core seems more important, and it’ll be hard enough to get through all that in the time we have. This is probably the one I most regret taking out…
  7. HTML/XML: there are lots of excellent tutorials on writing HTML, and while XML processing is a good way to introduce recursion (and, if XPath is included, to talk about programs as data once again), I believe once again that it’s not important enough to displace any of the material in the core.
  8. Object-oriented programming: is probably the omission that raises the most eyebrows. We can introduce it fairly naturally when talking about design for test (more specifically, about interface vs. implementation), but in practice, most people get along fine using lists, dictionaries, and the classes that come with the standard library without creating new classes of their own. Plus, showing people how to do OOP properly takes a lot more time than just showing them how to declare a class and give it methods.
  9. Desktop GUIs: an excellent way to introduce reactive (event-driven) programming and program frameworks, but is less important than it was ten years ago (most people would rather have a web interface these days).
  10. Web programming: the only thing we can teach people in the time we have is how to create security vulnerabilities.
  11. Security: the principles are easy to teach, but translating them into practice requires more knowledge (especially of things like web programming) than we can assume our learners have.
  12. Visualization: everybody wants it, but nobody can agree what it means. Should we show people how to use a specific library to create 3D vector flows? Or the principles of visual design so that they can make nicer 2D charts? And no matter what we teach, will they actually learn enough to make a difference?
  13. Performance and parallelism: the most important lesson, which is in the core, is that the right data structures and algorithms can speed programs up more than any amount of code tuning. Everything after that is either inextricably tied to the specifics of a particular langauge implementation (performance tuning), or offers no low-hanging fruit (parallelism). The one exception is job-level parallelism, which could be included in the material on the Unix shell if an appropriate cross-platform tool could be found.
  14. C/C++, Fortran, C#, or Java: more to introduce fixed typing and compilation, but these are relatively low priority topics.

We’re going to start implementing this plan (or some derivative of it) at the beginning of February, to be ready for workshops starting at the end of that month. We’d welcome feedback; in particular, have we taken something out of the core that you think is more important than something that’s in, and that could be taught in the time that’s actually available? If you have thoughts, please let us know.

Categories: Content, Version 5.0 Tags:

The First Boot Camp of 2012

January 20th, 2012 No comments

We just wrapped up the first boot camp of 2012 at the Space Telescope Science Institute. 14 scientists with a wide variety of computational backgrounds spent two days learning about testing, version control, program structure, the basics of Python, and the psychology of learning and programming. We’re following up with 6 weeks of online material, partly because that’s what fits everyone’s schedules, and partly to see whether a blended approach works better than either strategy on its own.

And on a completely different topic, this diagram from the Discover magazine web site sums up every scientist-vs-journalist debate ever:

Why Is This Hard?

January 15th, 2012 9 comments

I’ve been teaching scientists to program since 1998 (or 1986, if you want to start with my first lunch-and-learn for grad students in physics at the University of Edinburgh). Technology has advanced by leaps and bounds in that time, but I don’t think it’s any easier than it used to be to get basic software skills into people’s heads. What makes it hard?

Programming is intrinsically difficult. It’s fashionable to claim otherwise, but abstract thinking is a fairly recent innovation in evolutionary terms, and our brains still find it hard. On the other hand, I don’t believe that state machines and data transformations are any harder than high school algebra, and everyone we’re trying to help has long since mastered that.

Today’s languages and tools make it more difficult. Setup (particularly installation) is, if anything, harder than it was twenty years ago, and even the cleanest languages are full of accidental complexity (particularly in their libraries). (And if you think otherwise, try running a programming workshop for non-programmers working on half a dozen different operating systems, with two or three slightly different versions of your favorite language installed, and then post your dissenting comment.) It’s heartening to see that people are finally reviving research from the 1970s and 1980s into the usability of programming languages, but as we found out the hard way, it will be a long time before computer “scientists” start accepting scientific answers to these questions, much less acting on them.

Our students’ diverse backgrounds make teaching more difficult too. Our recent workshop at the University of Toronto had students from linguistics through chemistry to astronomy. Some of them had never used a command shell before; others were their labs’ unofficial sys admins, and we saw similar variation in almost every other aspect of their computing knowledge. The solution, of course, would be to divide them into levels by topic, but—

We don’t have resources to teach widely or deeply. Tens of thousands of people could teach scientists and engineers basic computing skills [1], but we have no way to reach them—yet. One of our goals for the next six months/five years is to increase the pool of instructors by several orders of magnitude [2]. Even on a five-year timescale, though, we’ll have to continue to rely mostly on volunteers, because—

There’s no room for computing in the curriculum. More precisely, faculty won’t make room, because they think computing is less important than thermodynamics, phonology, or whatever other subjects make up the core of their discipline. I used to grumble about this, but I now accept that it’s a rational choice: unless and until journal reviewers and grant agencies start asking hard question about how scientists produce their computational results, investing time in improving computational skills is a cost with uncertain rewards. And yes, there are a few exceptions here and there, but until we move to five- or six-year undergraduate degrees, they’ll continue to be exceptions. Realistically, I think the best we can hope for in the next decade is that computing has the same standing as statistics, i.e., everybody has to know the basics because their other work depends on it, but more advanced knowledge is acquired on a discipline-specific need-to-know basis.

Follow-through is hard. OK, so you just spent a couple of days at some kind of workshop: what now? If you’re lucky, you learned enough about Python or the shell to start automating a few data analysis tasks, so a positive feedback loop will kick in. But if the problem in front of you is to speed up 80,000 lines of legacy C++, those two days probably aren’t going to make a big difference. Yes, there are a lot of tutorials online that are supposed to help you, but in practice, you’ll probably find those more frustrating than anything else they assume a lot of background knowledge you don’t have, so you’re not sure which ones actually move you closer to your goal. The proposed Computational Science area at Stack Exchange might help here, if it takes off, and we’re hoping that running lesson-a-week online classes after workshops will help too, but it will always be hard for people to find time for “deep” learning, which is precisely what will make the next problem they run into easier to solve.

Most of today’s online teaching tools implement bad models of teaching. We’ve known for decades—literally, decades—that watching a video and doing some exercises is a lousy way to teach (see recent posts by Frank Noschese and Scott Gray for discussion). In programming terms, the root of the problem is that canned instruction assumes the teacher can accurately predict how learners are going to interpret and mis-interpret lessons—in software engineering terms, it’s plan-driven rather than adaptive. In practice, different learners will mis-interpret lessons in different (and hard-to-predict) ways; in order to be effective, teaching needs some sort of agile feedback loop to correct for this, but that’s exactly what most approaches to web-based teaching take out of the equation [3].

So, is it hopeless? Of course not: over the next six months, and (hopefully) the next five years, I believe we can make real progress on several fronts. We can certainly recruit and train more workshop organizers and instructors, and experiment with different kinds of online learning to see which will make follow-through easiest and most effective (which in turn depends on us coming up with ways to assess the impact we’re having). If you’d like to help, please get in touch.

[1] I get “tens of thousands” by taking a million competent programmers, multiplying by 10% (the proportion who can teach), and then multiplying by 10% again (the proportion who might be interested). Your made-up stats may vary.

[2] The other reason this has to be a priority is that our learners’ needs are as diverse as their backgrounds. Our learners want to jump straight from “what’s a for loop?” to “how do I detect glottal stops in lo-fi audio?” or “how do I visualize turbulent flow of interstellar gas?” We’re never going to be able to cover these with just a handful of content creators.

[3] Note that I’m using “online” to mean recorded and/or automated, i.e., things that learners can do when they want. Other approaches that deliver traditional lectures or seminars over the web synchronously and interactively are a bit better, but don’t scale: no webinar system I’ve ever seen gives the instructor the kind of feeling for the room that s/he’d get in a regular lecture hall.

Categories: Education, Version 5.0 Tags:

The What, Why, and How of Boot Camps

January 13th, 2012 No comments

We’ve just added a single-page description of the two-day boot camps we’re planning to run in the next six months. In brief, their aim is to ensure that people have a few core skills, so that they can tackle our online material productively, and to help them get past startup hurdles such as software installation. If you have questions, comments, or suggestions, please add them to that page; if you’d us to help you organize and run a boot camp, please get in touch.

Categories: Content, Version 5.0 Tags: