Teaching basic lab skills
for research computing

Key Points

On the flight back from Vancouver yesterday, I finally did what I should have done eight months ago and compiled the key points from our core lesson content. The results are presented below, broken down by lesson and topic; going forward, we're going to use something like this as a basis for defining what Software Carpentry is, and what workshop attendees can expect to learn.

The Shell

What and Why

  • The shell is a program whose primary purpose is to read commands, run programs, and display results.

Files and Directories

  • The file system is responsible for managing information on disk.
  • Information is stored in files, which are stored in directories (folders).
  • Directories can also store other directories, which forms a directory tree.
  • / on its own is the root directory of the whole filesystem.
  • A relative path specifies a location starting from the current location.
  • An absolute path specifies a location from the root of the filesystem.
  • Directory names in a path are separated with '/' on Unix, but '\' on Windows.
  • '..' means "the directory above the current one"; '.' on its own means "the current directory".
  • Most files' names are something.extension; the extension isn't required, and doesn't guarantee anything, but is normally used to indicate the type of data in the file.
  • cd path changes the current working directory.
  • ls path prints a listing of a specific file or directory; ls on its own lists the current working directory.
  • pwd prints the user's current working directory (current default location in the filesystem).
  • whoami shows the user's current identity.
  • Most commands take options (flags) which begin with a '-'.

Creating Things

  • Unix documentation uses '^A' to mean "control-A".
  • The shell does not have a trash bin: once something is deleted, it's really gone.
  • mkdir path creates a new directory.
  • cp old new copies a file.
  • mv old new moves (renames) a file or directory.
  • nano is a very simple text editor—please use something else for real work.
  • rm path removes (deletes) a file.
  • rmdir path removes (deletes) an empty directory.

Pipes and Filters

  • '*' is a wildcard pattern that matches zero or more characters in a pathname.
  • '?' is a wildcard pattern that matches any single character.
  • The shell matches wildcards before running commands.
  • command > file redirects a command's output to a file.
  • first | second is a pipeline: the output of the first command is used as the input to the second.
  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
  • cat displays the contents of its inputs.
  • head displays the first few lines of its input.
  • sort sorts its inputs.
  • tail displays the last few lines of its input.
  • wc counts lines, words, and characters in its inputs.

Loops

  • Use a for loop to repeat commands once for every thing in a list.
  • Every for loop needs a variable to refer to the current "thing".
  • Use $name to expand a variable (i.e., get its value).
  • Do not use spaces, quotes, or wildcard characters such as '*' or '?' in filenames, as it complicates variable expansion.
  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.
  • Use history to display recent commands, and !number to repeat a command by number.
  • Use ^C (control-C) to terminate a running command.

Shell Scripts

  • Save commands in files (usually called shell scripts) for re-use.
  • Use bash filename to run saved commands.
  • $* refers to all of a shell script's command-line arguments.
  • $1, $2, etc., refer to specified command-line arguments.
  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Finding Things

  • Everything is stored as bytes, but the bytes in binary files do not represent characters.
  • Use nested loops to run commands for every combination of two lists of things.
  • Use '\' to break one logical line into several physical lines.
  • Use parentheses '()' to keep things combined.
  • Use $(command) to insert a command's output in place.
  • find finds files with specific properties that match patterns.
  • grep selects lines in files that match patterns.
  • man command displays the manual page for a given command.

Version Control with Subversion

  • Version control is a better way to manage shared files than email or shared folders.
  • The master copy is stored in a repository.
  • Nobody ever edits the master directory: instead, each person edits a local working copy.
  • People share changes by committing them to the master or updating their local copy from the master.
  • The version control system prevents people from overwriting each other's work by forcing them to merge concurrent changes before committing.
  • It also keeps a complete history of changes made to the master so that old versions can be recovered reliably.
  • Version control systems work best with text files, but can also handle binary files such as images and Word documents.

Basic Use

  • Every repository is identified by a URL.
  • Working copies of different repositories may not overlap.
  • Each changed to the master copy is identified by a unique revision number.
  • Revisions identify snapshots of the entire repository, not changes to individual files.
  • Each change should be commented to make the history more readable.
  • Commits are transactions: either all changes are successfully committed, or none are.
  • The basic workflow for version control is update-change-commit.
  • svn add things tells Subversion to start managing particular files or directories.
  • svn checkout url checks out a working copy of a repository.
  • svn commit -m "message" things sends changes to the repository.
  • svn diff compares the current state of a working copy to the state after the most recent update.
  • svn diff -r HEAD compares the current state of a working copy to the state of the master copy.
  • svn history shows the history of a working copy.
  • svn status shows the status of a working copy.
  • svn update updates a working copy from the repository.

Merging Conflicts

  • Conflicts must be resolved before a commit can be completed.
  • Subversion puts markers in text files to show regions of conflict.
  • For each conflicted file, Subversion creates auxiliary files containing the common parent, the master version, and the local version.
  • svn resolve files tells Subversion that conflicts have been resolved.

Recovering Old Versions

  • Old versions of files can be recovered by merging their old state with their current state.
  • Recovering an old version of a file does not erase the intervening changes.
  • Use branches to support parallel independent development.
  • svn merge merges two revisions of a file.
  • svn revert undoes local changes to files.

Setting up a Repository

  • Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.
  • svnadmin create name creates a new repository.

Provenance

  • $Keyword:$ in a file can be filled in with a property value each time the file is committed.
  • Put version numbers in programs' output to establish provenance for data.
  • svn propset svn:keywords property files tells Subversion to start filling in property values.

Basic Programming

Basic Operations

  • Use '=' to assign a value to a variable.
  • Assigning to one variable does not change the values associated with other variables.
  • Use print to display values.
  • Variables are created when values are assigned to them.
  • Variables cannot be used until they have been created.
  • Addition ('+'), subtraction ('-'), and multiplication ('*') work as usual in Python.
  • Use meaningful, descriptive names for variables.

Creating Programs

  • Store programs in files whose names end in .py and run them with python name.py.

Types

  • The most commonly used data types in Python are integers (int), floating-point numbers (float), and strings (str).
  • Strings can start and end with either single quote (') or double quote (").
  • Division ('/') produces an int result when given int values: one or both arguments must be float to get a float result.
  • "Adding" strings concatenates them, multiplying strings by numbers repeats them.
  • Strings and numbers cannot be added because the behavior is ambiguous: convert one to the other type first.
  • Variables do not have types, but values do.

Reading Files

  • Data is either in memory, on disk, or far away.
  • Most things in Python are objects, and have attached functions called methods.
  • When lines are read from files, Python keeps their end-of-line characters.
  • Use str.strip to remove leading and trailing whitespace (including end-of-line characters).
  • Use file(name, mode) to open a file for reading ('r'), writing ('w'), or appending ('a').
  • Opening a file for writing erases any existing content.
  • Use file.readline to read a line from a file.
  • Use file.close to close an open file.
  • Use print >> file to print to a file.

Standard Input and Output

  • The operating system automatically gives every program three open "files" called standard input, standard output, and standard error.
  • Standard input gets data from the keyboard, from a file when redirected with '<', or from the previous stage in a pipeline with '|'.
  • Standard output writes data to the screen, to a file when redirected with '>', or to the next stage in a pipeline with '|'.
  • Standard error also writes data to the screen, and is not redirected by '>' or '|'.
  • Use import library to import a library.
  • Use library.thing to refer to something imported from a library.
  • The sys library provides open "files" called sys.stdin and sys.stdout for standard input and output.

Repeating Things

  • Use for variable in something: to loop over the parts of something.
  • The body of a loop must be indented consistently.
  • The parts of a string are its characters; the parts of a file are its lines.

Making Choices

  • Use if test to do something only when a condition is true.
  • Use else to do something when a preceding if test is not true.
  • The body of an if or else must be indented consistently.
  • Combine tests using and and or.
  • Use '<', '<=', '>=', and '>' to compare numbers or strings.
  • Use '==' to test for equality and '!=' to test for inequality.
  • Use variable += expression as a shorthand for variable = variable + expression (and similarly for other arithmetic operations).

Flags

  • The two Boolean values True and False can be assigned to variables like any other values.
  • Programs often use Boolean values as flags to indicate whether something has happened yet or not.

Reading Data Files

  • Use str.split() to split a string into pieces on whitespace.
  • Values can be assigned to any number of variables at once.

Provenance Revisited

  • Put version numbers in programs' output to establish provenance for data.

Lists

  • Use [value, value, ...] to create a list of values.
  • for loops process the elements of a list, in order.
  • len(list) returns the length of a list.
  • [] is an empty list with no values.

More About Lists

  • Lists are mutable: they can be changed in place.
  • Use list.append(value) to append something to the end of a list.
  • Use list[index] to access a list element by location.
  • The index of the first element of a list is 0; the index of the last element is len(list)-1.
  • Negative indices count backward from the end of the list, so list[-1] is the last element.
  • Trying to access an element with an out-of-bounds index is an error.
  • range(number) produces the list of numbers [0, 1, ..., number-1].
  • range(len(list)) produces the list of legal indices for list.

Checking and Smoothing Data

  • range(start, end) creates the list of numbers from start up to, but not including, end.
  • range(start, end, stride) creates the list of numbers from start up to end in steps of stride.

Nesting Loops

  • Use nested loops to do things for combinations of things.
  • Make the range of the inner loop depend on the state of the outer loop to automatically adjust how much data is processed.
  • Use min(...) and max(...) to find the minimum and maximum of any number of values.

Nesting Lists

  • Use nested lists to store multi-dimensional data or values that have regular internal structure (such as XYZ coordinates).
  • Use list_of_lists[first] to access an entire sub-list.
  • Use list_of_lists[first][second] to access a particular element of a sub-list.
  • Use nested loops to process nested lists.

Aliasing

  • Several variables can alias the same data.
  • If that data is mutable (e.g., a list), a change made through one variable is visible through all other aliases.

Functions and Libraries

How Functions Work

  • Define a function using def name(...)
  • The body of a function must be indented.
  • Use name(...) to call a function.
  • Use return to return a value from a function.
  • The values passed into a function are assigned to its parameters in left-to-right order.
  • Function calls are recorded on a call stack.
  • Every function call creates a new stack frame.
  • The variables in a stack frame are discarded when the function call completes.
  • Grouping operations in functions makes code easier to understand and re-use.

Global Variables

  • Every function always has access to variables defined in the global scope.
  • Programmers often write constants' names in upper case to make their intention easier to recognize.
  • Functions should not communicate by modifying global variables.

Multiple Arguments

  • A function may take any number of arguments.
  • Define default values for parameters to make functions more convenient to use.
  • Defining default values only makes sense when there are sensible defaults.

Returning Values

  • A function may return values at any point.
  • A function should have zero or more return statements at its start to handle special cases, and then one at the end to handle the general case.
  • "Accidentally" correct behavior is hard to understand.
  • If a function ends without an explicit return, it returns None.

Aliasing

  • Values are actually passed into functions by reference, which means that they are aliased.
  • Aliasing means that changes made to a mutable object like a list inside a function are visible after the function call completes.

Libraries

  • Any Python file can be imported as a library.
  • The code in a file is executed when it is imported.
  • Every Python file is a scope, just like every function.

Standard Libraries

  • Use from library import something to import something under its own name.
  • Use from library import something as alias to import something under the name alias.
  • from library import * imports everything in library under its own name, which is usually a bad idea.
  • The math library defines common mathematical constants and functions.
  • The system library sys defines constants and functions used in the interpreter itself.
  • sys.argv is a list of all the command-line arguments used to run the program.
  • sys.argv[0] is the program's name.
  • sys.argv[1:] is everything except the program's name.

Building Filters

  • If a program isn't told what files to process, it should process standard input.
  • Programs that explicitly test values' types are more brittle than ones that rely on those values' common properties.
  • The variable __name__ is assigned the string '__main__' in a module when that module is the main program, and the module's name when it is imported by something else.
  • If the first thing in a module or function is a string that isn't assigned to a variable, that string is used as the module or function's documentation.
  • Use help(name) to display the documentation for something.

Functions as Objects

  • A function is just another kind of data.
  • Defining a function creates a function object and assigns it to a variable.
  • Functions can be assigned to other variables, put in lists, and passed as parameters.
  • Writing higher-order functions helps eliminate redundancy in programs.
  • Use filter to select values from a list.
  • Use map to apply a function to each element of a list.
  • Use reduce to combine the elements of a list.

Databases

  • A relational database stores information in tables with fields and records.
  • A database manager is a program that manipulates a database.
  • The commands or queries given to a database manager are usually written in a specialized language called SQL.

Selecting

  • SQL is case insensitive.
  • The rows and columns of a database table aren't stored in any particular order.
  • Use SELECT fields FROM table to get all the values for specific fields from a single table.
  • Use SELECT * FROM table to select everything from a table.

Removing Duplicates

  • Use SELECT DISTINCT to eliminate duplicates from a query's output.

Calculating New Values

  • Use expressions in place of field names to calculate per-record values.

Filtering

  • Use WHERE test in a query to filter records based on logical tests.
  • Use AND and OR to combine tests in filters.
  • Use IN to test whether a value is in a set.
  • Build up queries a bit at a time, and test them against small data sets.

Sorting

  • Use ORDER BY field ASC (or DESC) to order a query's results in ascending (or descending) order.

Aggregation

  • Use aggregation functions like SUM MAX to combine many query results into a single value.
  • Use the COUNT function to count the number of results.
  • If some fields are aggregated, and others are not, the database manager displays an arbitrary result for the unaggregated field.
  • Use GROUP BY to group values before aggregation.

Database Design

  • Each field in a database table should store a single atomic value.
  • No fact in a database should ever be duplicated.

Combining Data

  • Use JOIN to create all possible combinations of records from two or more tables.
  • Use JOIN tables ON test to keep only those combinations that pass some test.
  • Use table.field to specify a particular field of a particular table.
  • Use aliases to make queries more readable.
  • Every record in a table should be uniquely identified by the value of its primary key.

Self Join

  • Use a self join to combine a table with itself.

Missing Data

  • Use NULL in place of missing information.
  • Almost every operation involving NULL produces NULL as a result.
  • Test for nulls using IS NULL and IS NOT NULL.
  • Most aggregation functions skip nulls when combining values.

Nested Queries

  • Use nested queries to create temporary sets of results for further querying.
  • Use nested queries to subtract unwanted results from all results to leave desired results.

Creating and Modifying Tables

  • Use CREATE TABlE name(...) to create a table.
  • Use DROP TABLE name to erase a table.
  • Specify field names and types when creating tables.
  • Specify PRIMARY KEY, NOT NULL, and other constraints when creating tables.
  • Use INSERT INTO table VALUES(...) to add records to a table.
  • Use DELETE FROM table WHERE test to erase records from a table.
  • Maintain referential integrity when creating or deleting information.

Transactions

  • Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable.

Programming With Databases

  • Most applications that use databases embed SQL in a general-purpose programming language.
  • Database libraries use connections and cursors to manage interactions.
  • Programs can fetch all results at once, or a few results at a time.
  • If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries.
  • Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks.

Number Crunching with NumPy

  • High-level libraries are usually more efficient for numerical programming than hand-coded loops.
  • Most such libraries use a data-parallel programming model.
  • Arrays can be used as matrices, as physical grids, or to store general multi-dimensional data.

Basics

  • NumPy is a high-level array library for Python.
  • import numpy to import NumPy into a program.
  • Use numpy.array(values) to create an array.
  • Initial values must be provided in a list (or a list of lists).
  • NumPy arrays store homogeneous values whose type is identified by array.dtype.
  • Use old.astype(newtype) to create a new array with a different type rather than assigning to dtype.
  • numpy.zeros creates a new array filled with 0.
  • numpy.ones creates a new array filled with 1.
  • numpy.identity creates a new identity matrix.
  • numpy.empty creates an array but does not initialize its values (which means they are unpredictable).
  • Assigning an array to a variable creates an alias rather than copying the array.
  • Use array.copy to create a copy of an array.
  • Put all array indices in a single set of square brackets, like array[i0, i1].
  • array.shape is a tuple of the array's size in each dimension.
  • array.size is the total number of elements in the array.

Storage

  • Arrays are stored using descriptors and data blocks.
  • Many operations create a new descriptor, but alias the original data block.
  • Array elements are stored in row-major order.
  • array.transpose creates a transposed alias for an array's data.
  • array.ravel creates a one-dimensional alias for an array's data.
  • array.reshape creates an arbitrarily-shaped alias for an array's data.
  • array.resize resizes an array's data in place, filling with zero as necessary.

Indexing

  • Arrays can be sliced using start:end:stride along each axis.
  • Values can be assigned to slices as well as read from them.
  • Arrays can be used as subscripts to select items in arbitrary ways.
  • Masks containing True and False can be used to select subsets of elements from arrays.
  • Use '&' and '|' (or logical_and and logical_or) to combine tests when subscripting arrays.
  • Use where, choose, or select to select elements or alternatives in a single step.

Linear Algebra

  • Addition, multiplication, and other arithmetic operations work on arrays element-by-element.
  • Operations involving arrays and scalars combine the scalar with each element of the array.
  • array.dot performs "real" matrix multiplication.
  • array.sum calculates sums or partial sums of array elements.
  • array.mean calculates array averages.

Making Recommendations

  • Getting data in the right format for processing often requires more code than actually processing it.
  • Data with many gaps should be stored in sparse arrays.
  • numpy.cov calculates variancess and covariances.

The Game of Life

  • Padding arrays with fixed elements is an easy way to implement boundary conditions.
  • scipy.signal.convolve applies a weighted mask to each element of an array.

Quality

Defensive Programming

  • Design programs to catch both internal errors and usage errors.
  • Use assertions to check whether things that ought to be true in a program actually are.
  • Assertions help people understand how programs work.
  • Fail early, fail often.
  • When bugs are fixed, add assertions to the program to prevent their reappearance.

Handling Errors

  • Use raise to raise exceptions.
  • Raise exceptions to report errors rather than trying to handle them inline.
  • Use try and except to handle exceptions.
  • Catch exceptions where something useful can be done about the underlying problem.
  • An exception raised in a function may be caught anywhere in the active call stack.

Unit Testing

  • Testing cannot prove that a program is correct, but is still worth doing.
  • Use a unit testing library like Nose to test short pieces of code.
  • Write each test as a function that creates a fixture, executes an operation, and checks the result using assertions.
  • Every test should be able to run independently: tests should not depend on one another.
  • Focus testing on boundary cases.
  • Writing tests helps us design better code by clarifying our intentions.

Numbers

  • Floating point numbers are approximations to actual values.
  • Use tolerances rather than exact equality when comparing floating point values.
  • Use integers to count and floating point numbers to measure.
  • Most tests should be written in terms of relative error rather than absolute error.
  • When testing scientific software, compare results to exact analytic solutions, experimental data, or results from simpler or previously-tested programs.

Coverage

  • Use a coverage analyzer to see which parts of a program have been tested and which have not.

Debugging

  • Use an interactive symbolic debugger instead of print statements to diagnose problems.
  • Set breakpoints to halt the program at interesting points instead of stepping through execution.
  • Try to get things right the first time.
  • Make sure you know what the program is supposed to do before trying to debug it.
  • Make sure the program is actually running the test case you think it is.
  • Make the program fail reliably.
  • Simplify the test case or the program in order to localize the problem.
  • Change one thing at a time.
  • Be humble.

Designing Testable Code

  • Separating interface from implementation makes code easier to test and re-use.
  • Replace some components with simplified versions of themselves in order to simplify testing of other components.
  • Do not create arbitrary, variable, or random results, as they are extremely hard to test.
  • Isolate interactions with the outside world when writing tests.

Sets and Dictionaries

Sets

  • Use sets to store distinct unique values.
  • Create sets using set() or {v1, v2, ...}.
  • Sets are mutable, i.e., they can be updated in place like lists.
  • A loop over a set produces each element once, in arbitrary order.
  • Use sets to find unique things.

Storage

  • Sets are stored in hash tables, which guarantee fast access for arbitrary keys.
  • The values in sets must be immutable to prevent hash tables misplacing them.
  • Use tuples to store multi-part elements in sets.

Dictionaries

  • Use dictionaries to store key-value pairs with distinct keys.
  • Create dictionaries using {k1:v1, k2:v2, ...}
  • Dictionaries are mutable, i.e., they can be updated in place.
  • Dictionary keys must be immutable, but values can be anything.
  • Use tuples to store multi-part keys in dictionaries.
  • dict[key] refers to the dictionary entry with a particular key.
  • key in dict tests whether a key is in a dictionary.
  • len(dict) returns the number of entries in a dictionary.
  • A loop over a dictionary produces each key once, in arbitrary order.
  • dict.keys() creates a list of the keys in a dictionary.
  • dict.values() creates a list of the keys in a dictionary.

Simple Examples

  • Use dictionaries to count things.
  • Initialize values from actual data instead of trying to guess what values could "never" occur.

Phylogenetic Trees

  • Problems that are described using matrices can often be solved more efficiently using dictionaries.
  • When using tuples as multi-part dictionary keys, order the tuple entries to avoid accidental duplication.

Development

The Grid

  • Get something simple working, then start to add features, rather than putting everything in the program at the start.
  • Leave FIXME markers in programs as you are developing them to remind yourself what still needs to be done.

Aliasing

  • Draw pictures of data structures to aid debugging.

Randomness

  • Use a well-tested random number generation library to generate pseudorandom values.
  • If a random number generation library is given the same seed, it will produce the same sequence of values.

Neighbors

  • and and or stop evaluating arguments as soon as they have an answer.

Bugs

  • Test programs with successively more complex cases.

Refactoring

  • Refactor programs as necessary to make testing easier.
  • Replace randomness with predictability to make testing easier.

Performance

  • Scientists want faster programs both to handle bigger problems and to handle more problems with available resources.
  • Before speeding a program up, ask, "Does it need to be faster?" and, "Is it correct?"
  • Recording start and end times is a simple way to measure performance.
  • Analyze algorithms to predict how a program's performance will change with problem size.

Profiling

  • Use a profiler to determine which parts of a program are responsible for most of its running time.

A New Beginning

  • Better algorithms are better than better hardware.

CONTENT

Dialogue & Discussion

You can review our commenting policy here.