Up: The Shell

Pipes and Filters

slide 01 Hello, and welcome to the fourth episode of the Software Carpentry lecture on the Unix shell. In this episode, we’ll look at what makes the shell so powerful: the ease with which it lets you combine existing programs in new ways.
slide 02 As we saw in previous episodes, a shell is a program that takes commands from the user, tells the computer to run the corresponding programs, and shows the user their output.
slide 03 We’ve already seen commands to move around the filesystem, and to create, rename, copy, and delete files and directories.
slide 04 What we’ll see in this episode is that commands like these are much more powerful when they’re combined.
slide 05 We’ll start with a directory called molecules that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
slide 06 Let’s go into that directory with cd
slide 07 …and run the command wc *.pdb.
slide 08 The * in *.pdb is a wildcard character.
slide 09 It matches zero or more characters.
slide 10 So the shell expands the expression *.pdb to be the complete list of .pdb files.
slide 11 The shell does this before wc runs, so the actual command is wc cubane.pdb ethane.pdb and so on.
slide 12 wc stands for “word count”.
slide 13 It counts the number of lines, words, and characters in files.
slide 14 Its output, shown here, prints these values in columns: lines, words, characters, and the filename, one line per file, with a line for the total at the end.
slide 15 If we run wc -l instead, our output shows only the number of lines per file.
slide 16 We can use -w to get only the number of words, or -c to get only the number of characters.
slide 17 Now, which of these files is shortest?
slide 18 It’s an easy question to answer when there are only six files…
slide 19 …but what if there were 6000? That’s the kind of job we want a computer to do.
slide 20 Our first step toward a solution is to run the command wc -l *.pdb > lengths.
slide 21 > tells the shell to redirect the output to a file instead of printing it to the screen.
slide 22 The shell will create the file if it doesn’t exist…
slide 23 …or overwrite its contents if it does.
slide 24 Notice that there is no screen output: everything that wc would have printed has gone into the file lengths instead.
slide 25 ls lengths confirms that the file exists.
slide 26 And we can print its contents to the screen using cat lengths.
slide 27 cat stands for “concatenate”: it prints the contents of files one after another.
slide 28 In this case, there’s only one file, so cat just shows us what’s in it.
slide 29 Now let’s use the sort command to sort its contents. This does not change the file: instead, it prints the sorted lines to the screen as shown here.
slide 30 We can put the sorted list of lines in another temporary file called sorted-lengths by putting > sorted-lengths after the command, just as we used > lengths to put the output of wc into lengths.
slide 31 And now, we can run another command called head to get the first few lines in sorted-lengths.
slide 32 Giving head the argument -1 tells us we only want the first line of the file; -20 would get the first 20, and so on.
slide 33 This must be the file with the fewest lines, since sorted-lengths holds files and their line counts in order from the least to the most.
slide 34 If you think this is confusing, you’re in good company: even once you understand what wc, sort, and head do, all those intermediate files make it hard to follow what’s going on. How can we make it easier to understand?
slide 35 Let’s start by getting rid of the sorted-lengths file by running the sort and head commands together.
slide 36 That vertical bar between them is called a pipe.
slide 37 It tells the shell that we want to take the output of the command on the left…
slide 38 …and use it as the input to the command on the right…
slide 39 …without explicitly creating a temporary file. The computer can create such a file itself if it wants to, or run the two programs simultaneously and pass data from one to the other through memory without ever putting it on disk: we don’t have to know or care.
slide 40 Well, if we don’t need to create the temporary file sorted-lengths, can we get rid of the lengths file too? The answer is “yes”: we can use another pipe to send the output of wc directly to sort, which then sends its output to head. This is exactly like a mathematician nesting functions and saying “the square of the sine of x times π”: in our case, the calculation is “head of sort of word count of *.pdb“.
slide 41 This simple idea is why Unix has been so successful.
slide 42 Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that:
slide 43 each do one job well, and
slide 44 work well with each other.
slide 45 Ten such tools can be combined in 100 ways, and that’s only looking at pairings: when we start to look at pipes with multiple stages, the possibilities are almost endless.
slide 46 Here’s what actually happens behind the scenes when we create a pipe. We’ll use an octagon to show a running program.
slide 47 The technical term for this is a process: it’s a program that’s actually loaded into memory and “live”.
slide 48 Every process has an input channel called standard input. By this point, you may be surprised that the name is so memorable, but don’t worry:
slide 49 most Unix programmers call it stdin, just to be safe.
slide 50 Every process also has a default output channel called standard output, or stdout.
slide 51 When we run a program normally, the shell temporarily sends whatever we type on our keyboard to the process’s stdin, and sends whatever the process prints to stdout to our computer’s screen.
slide 52 For example, if we run wc -l *.pdb > lengths
slide 53 …the shell starts by telling the computer to create a new process to run the wc program.
slide 54 Since we’ve provided some filenames as arguments, wc reads from them instead of from standard input.
slide 55 And since we’ve used > to redirect output to a file, the shell connects the process’s standard output to that file.
slide 56 Here’s what happens when we run wc -l *.pdb | sort instead. The shell creates two processes, one for each component of the pipe, so that wc and sort run simultaneously. The standard output of wc is fed directly to the standard input of sort; since there’s no redirection with >, sort‘s output goes to the screen.
slide 57 And if we run wc -l *.pdb | sort | head -1, we get the three processes shown here, with data flowing from the files, through wc to sort, and from sort through head to the screen.
slide 58 This programming model is called pipes and filters.
slide 59 A filter is a program that transforms a stream of input into a stream of ouptut. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from stdin, do something to what they’ve read, and write to stdout.
slide 60 A pipe is just a connection between two filters. Behind the scenes, the computer may do some clever things to move data around, but from the user’s point of view, all a pipe does is move bytes from one process to another.
slide 61 The key is that any program that reads lines of text from standard input, and writes lines of text to standard output, can work with every other program that behaves this way as well.
slide 62 You and and should write your programs this way, so that you and other people can put those programs into pipes to multiply their power.
slide 63 To summarize, we now have a bunch of commands for moving around the file system…
slide 64 …and three for working with text: wc to count things, sort to sort them, and head to select lines from the front of a file.
slide 65 After this episode is over, please go and explore a few other simple text processing commands, such as tail, split, cut, and uniq. Remember, each tool you learn multiplies the power of the tools you already know.
slide 66 We’ve also met three more special characters: the pattern-matching wildcard *, redirection with >, and most important of all, the pipe |, which allows us to connect processes together.
slide 67 Again, once this episode is over, please take a moment to find out what two other characters do: <, which redirect input, and ?, a wildcard that matches a single character instead of any number.
slide 68 In our next episode, we’ll have a look at how Unix controls who can do what to files and directories.

  1. Armindo Salvador
    November 21st, 2010 at 17:59 | #1

    Thanks for the nice tutorial. But if I understood your explanation correctly then there is a mistake in the example in slide 41, where you explain how to get rid of both intermediate files using pipes. The command should read as follows:

    wc -l *.pdb | sort | head -1

    –Armindo

  2. December 24th, 2010 at 04:41 | #2

    Please correct the commands

    ….| sort lenghts | ….

    to

    …. | sort | ….

    Thank you for the great tutorials.

  3. January 16th, 2011 at 21:37 | #3

    Bump on the previous requests to correct | sort lengths | to | sort | (I only read through the slides I didn’t watch the video so I’m not sure if the error is still in the video as well). Also, there is a typo in the text next to the 7th slide from the end. It should read “You can and should”, not “You and and should”.

  4. January 19th, 2011 at 18:34 | #4

    Some things that are useful to know before starting the mini-homework.

    1. You get help on cygwin commands with a –help option or with the info command, e.g.,
    split –help
    info split

    (man -k split and help split would also be possibilities, but in this case return nothing useful.)

    2. When the help page tells you an option takes the form –option=VALUE, you don’t type the equals sign, e.g.,

    In the split help text, the lines option in given as

    -l, –lines=NUMBER put NUMBER lines per output file

    You type
    split -l10
    not
    split -l=10

  5. January 19th, 2011 at 18:36 | #5

    @Richie Cotton
    Commenting system has turned my double hyphens into ndashes. That’s split dashdashhelp above.