Archive

Archive for May, 2010

Program Design: the First Third

May 31st, 2010 2 comments

I spent last week making and re-making slides for a lecture on program design, and managed to record the first third of them today—the rest should follow tomorrow and Wednesday.  We’ve taken on board your comments about pace and layout, and would be grateful for more feedback—please go ahead and post comments on this post to tell us what you think.

  1. Introduction
  2. The Grid
  3. Aliasing
  4. Randomness

Categories: Lectures, Version 4 Tags:

Jim Graham on Reproducibility

May 29th, 2010 1 comment

In response to Titus Brown’s not-really-joking spoof of how most scientists manage their data, Scimatic’s Jim Graham has asked, “What is reproducibility, anyway?” His main point is that if I re-run your code on your data, and get the same result that you got, I haven’t actually added to the sum total of human knowledge. If I get an equivalent answer using different code, on the other hand, then our confidence in the answer is usefully increased. What do you think?

Categories: Noticed Tags:

Teaching databases by example

May 28th, 2010 No comments

Over the last two weeks I’ve been spending most of my Software Carpentry time working on the database unit.  They began as a fairly straight-forward translation of the Software Carpentry 3.0 lecture notes with only a few changes to the sequencing of the topics. The plan for the unit is fairly simple: each screencast introducing only one or two new topics, and builds on the previous screencasts with as few as possible forward references to later topics.  The topics themselves are really just the different language features (SELECT, WHERE, JOIN, etc..) presented as tasks you might want to perform, and illustrated by working with toy data.

It might be clearer if I present the topic plan we had in mind:

  1. An Introduction to Databases (What is a database and why/when would you use one?)
  2. Getting data from a table, filtering, and sorting it. (SELECT, WHERE, and ORDER BY)
  3. Aggregating and grouping results (GROUP BY, and aggregation functions: SUM, MAX, etc..)
  4. Dealing with empty or missing data (NULL)
  5. Combining data from multiple tables (inner JOINs)
  6. Advanced queries (subqueries)
  7. An overview other features (e.g. HAVING, expressions in SELECT, LIMIT, …)

And then this week Greg suggested I take a look at this gem of a book (sadly, out of print):

The Essence of SQL: A Guide to Learning Most of SQL in the Least Amount of Time by David Rozenshtein

It takes a completely different and, I think, much more useful approach.  It begins with a list of typical questions you might ask about a database of students/courses/profs. For example, “What are the student numbers and names of students who take CS112? “, “Who are the youngest students?”, “Who does not take CS112?”, or “Who takes a course which is not CS112?”. These questions are meant as prototypes of the sorts of questions you would use any database to answer.  The book proceeds through each question and explains how you’d use SQL to answer it, why it makes sense to do it that way, and why it even works in the first place.

My intuition about this approach is that it makes for a great way to learn about databases.  Structuring the book around the prototypical questions will serve as a really useful way to refer back to the course later when you have a real problem to solve, as well as being much more motivating to have the unit be problem-based.  My concern is that by organising the database unit around these questions we’ll be stuck mixing the language features throughout the units with all sorts of cross-referencing needed in case people just want to learn about using a WHERE clause.

What do you think?  Greg suggests that maybe the Rozenshtein approach is something we use in combination with our original approach: we first cover the language features in separate screencasts, we then use the Rozenshtein questions to pull it all together.

There’s another question I have that’s separate from the question of which approach we choose.  In my opinion, having real data to teach around is more exciting and educational than creating toy data that’s tuned just to the specific topic.  The Nobel Prize data, and the Experiments and Researchers data sets we’ve used in the first few lectures are okay, but wouldn’t it just be so much more interesting if we started with, say data about tabacco use and physical activity, and each screencast taught about a language feature by posing and answering questions about this dataset.  This is something the book Data Analysis Using SQL and Excel by Gordon S. Linoff does really well.

So, another question for us (and now for you!) is, what real-world data should we use to teach?  We need dataset we can tell a good story around.  The data also needs to be varied enough so that the queries results in our small screencast size don’t need to be scrolled, and span several tables so that we can illustrate different join types and such.

Categories: Content, Version 4 Tags:

Badges and Stars

May 27th, 2010 No comments

The Open Notebook Science folks have developed a set of badges to help people label their work, ranging from “All Content – Immediate Release” to “Selected Content – Delayed Release”. Like the badges for various Creative Commons licenses, or Tim Berners-Lee’s proposed five-star system for rating open data, the biggest benefit of this kind of categorization is that it encourages people to think more clearly about what they are (or aren’t) doing, and why.

Categories: Opinion Tags:

Archiving Experiments to Raise Scientific Standards

May 25th, 2010 No comments

An NSF Workshop on Archiving Experiments to Raise Scientific Standards begins today (May 25, 2010) at the University of Utah — the schedule has links to some of the presentations, and there’s a live video feed as well. Topic areas range from networking and compilers to geophysics and the life sciences — it’ll be interesting to see what comes out, and how it interacts with the Open Provenance work.

Categories: Noticed Tags:

Evaluating Methods and Protocols

May 19th, 2010 No comments

From their home page:

ScienceCheck.org is the only site dedicated specifically to sharing objective evaluations of published experimental methods/protocols, from researchers with real-world experience.

Sadly, the “computer science” category (filed under “Other”) is currently empty, and there isn’t a “computational science” category as far as I can see. Still, it’s a beginning…

Categories: Noticed Tags:

We’ll Know We’ve Succeeded If…

May 18th, 2010 1 comment

We will know the students taking this course have learned something if:

  1. They understand why Titus Brown’s Data Management Plan post is both funny and sad.
  2. They have the knowledge, skills, and tools they need to do something about it.

Categories: Opinion Tags:

Day 11: Slides

May 17th, 2010 1 comment

Today’s screencast experiment [link no longer active] is narration over PowerPoint slides: it isn’t animation per se, but we’d like your feedback on whether something like this is a good way to explain “big picture” concepts.

Categories: Lectures, Version 4 Tags:

Day 10: Closed Captioning

May 14th, 2010 1 comment

Our latest screencast, on NULL values in SQL, is now online. [Link no longer active; more recent screencast on NULL values.] Unlike its predecessors, this one has closed captions (as well as a transcript in the enclosing page). Please let us know what you think: are the captions helpful, or do you find them distracting?

Categories: Lectures, Version 4 Tags:

Why Most Scientists Don’t Like Computers

May 14th, 2010 5 comments

Psychology is fascinating: so much of what we think we know about people turns out not to be true, while so many everyday oddities turn out to have rational explanations (for some version of “rational”). For example, I’ve known for twenty-five years that most scientists dread the point in their work when they have to do something new with a computer. I’ve finally figured out why: it’s because the yield-to-effort function is wildly discontinuous, and human beings hate that.

Let’s start with a result from social psychology. Suppose people have a choice between waiting 5 minutes for the bus every single time, or waiting 1 minute nine times out of ten, and 20 minutes the tenth time. On average, they’re better off in the second case, but almost everyone prefers the first: they value predictability over pure economic yield.

Now, suppose you’re a scientist—or anyone else, for that matter—and you’re sitting down to do something with a computer that you haven’t done before. How long is it going to take? You don’t know, and what’s worse, no matter how much experience you have with computers, you still don’t know. You’re always in the situation of someone who’s wondering whether the bus will arrive in 1 minute or 20 minutes.

For example, we’re using TechSmith’s Camtasia to create our screencasts. Between us, we have six degree in computer science and over forty years of experience, but we’re still twisting our ankles in potholes every time we try to do something new. This morning’s job was to add closed captioning to a screencast that Jon created discussing NULL in SQL. There’s a video on the Camtasia site showing how to do this with the Windows version, but Jon did his recording on a Mac. Can I add captions there? Not as of last August, and I can’t find anything more recent on the web or in Camtasia’s help to indicate that the situation has changed. Can the Windows version open the Mac project? No — if I want to switch from one version of the company’s flagship product to another, I have to export the content files (the audio and video) from the Mac version and load them into a new project on Windows. If I do that and add captions, will those captions survive me moving things back to the Mac? I have no idea…

Scientists run into this kind of thing all the time. They know something ought to be doable, but they have no way of knowing how long it will take to straighten out the kinks and make it happen. This naturally—and rightly—makes them very conservative: they’d rather spend 10 hours doing something the “wrong” way than take a chance that some new technique might let them finish in 1 hour, because experience has taught them that switching might wind up taking 20 hours instead. I now believe that predictable effort matters more to most people than “pure” usability or overall capability: given a choice between knowing what they’re getting themselves into, having an easy ride, or eventually being able to accomplish a lot, most people will choose the first. What I don’t know is how to reflect that in the design of this course…

Categories: Opinion Tags: