Up: The Shell

Advanced Tricks

August 2nd, 2011 Leave a comment Go to comments
slide 001 Hello and welcome to the Advanced Shell Tricks episode of the Software Carpentry lectures on the Unix shell. In this episode, we’ll look at some handy advanced shell techniques that can save you time.
slide 002 So, in general, you encounter a technical problem and are wondering how to solve it…
slide 003 For example, on your iPhone or Android smartphone you may hear, “There’s an app for that… check this out!”
slide 004 Whereas Unix shell programmers will say “There’s a shell trick for that… check this out”… whilst perhaps recommending you upgrade to an Android smartphone.
slide 005 In previous episodes, we’ve seen how to do a number of things. Combine existing programs using pipes and filters. For example, counting the number of lines in all pdb files, then sorting those results and picking the top one (i.e. the result with the greatest number of lines).
slide 006 Redirect output from programs to files. For example, counting the number of lines in all pdb files and storing the results in a file.
slide 007 Use variables to control program operation. For example, creating a new variable called SECRET_IDENTITY and assigning the value “Dracula” to it.
slide 008 And this is one of the true strengths of the Unix shell; the way you can compose all these techniques together. We’ll be seeing more of this in this episode!
slide 009 We’ve covered some very handy techniques already, but of course we can go further. Redirection for example has some other very useful tricks you can easily learn and use which we’ll look at.
slide 010 We’ve already seen how we can redirect program output to a file, but what else can we do with redirection? Let’s revisit our pdb files that we’ve seen in a previous episode. As a reminder, PDB files are Protein Data Format files.
slide 011 So with this command, we can list all the files with a pdb filename extension in the current directory, redirecting the results to a file we call “files”.
slide 012 This is all made possible by the “redirection” operator.
slide 013 But what about adding this together with other results generated later? In general, this would be very useful for any time we just want to add things to an existing file. In this example, let us consider a further set of protein data files in the older “ent” format… how do we perform the same ls operation and add these results to the previous results in “files”?
slide 014 We can first get a list of the files with an ent extension…
slide 015 …and put that list in a separate file.
slide 016 Then, we can use our concatenate “cat” command to create a new file which has the contents of both files in it. But it’s a bit long winded — couldn’t we just “append” the results to our existing file “files”? The answer is yes!
slide 017 This command does exactly that — the output from this command is redirected to a file as before, but in an “append” sense.
slide 018 Also note that if the file doesn’t exist beforehand, it is created.
slide 019 Now let’s look at something a little different…
slide 020 So we know ls operates within its own process, and all output is normally directed through its parent “shell” process to the display.
slide 021 But in the case of redirection, ls still operates within its own process, but instead of its output being directed through its parent shell process to the display, it is redirected to a file.
slide 022 Now here’s something to ponder… what happens with error messages?
slide 023 For example, running ls on a non-existent path will give an error.
slide 024 Ok, no problem so far…
slide 025 But why isn’t the error message in “files”? This is the big question! Why, and how, are we seeing it?
slide 026 In essence, this is because the standard output and standard error are separate “channels”.
slide 027 So what was happening with the previous example? This is how we looked at it before, with standard output.
slide 028 So let’s expand on this a little, by adding in the standard error.
slide 029 Now we see that the standard error is not being redirected, like the standard output, to a file.
slide 030 Perhaps unsurprisingly, there is a way to capture standard error as well using the Unix shell.
slide 031 So this “2>” operator deals with the redirection of standard error only.
slide 032 As you might expect, error messages end up in our error-log file.
slide 033 We can redirect both stdout and stderr like this. Plus, the order in which we add both redirections to the command doesn’t matter.
slide 034 So we can redirect both standard error and standard output simultaneously.
slide 035 Of course, we could add other directories into this list too, perhaps one that does exist.
slide 036 So what’s this number 2 all about?
slide 037 The “2″ refers to the standard error channel, whilst “1″ refers to the standard output.
slide 038 By default, “>” on its own refers to standard output, so we could remove the “1″ before the first greater-than sign for the same effect.
slide 039 “&” is a useful shorthand if you want a single log of everything.
slide 040 We can even use append here as well.
slide 041 To summarise part 1, we’ve looked in more depth at redirection. We’ve looked at redirecting standard output and standard error to a file, overwriting anything in the file previously, or creating it if it doesn’t exist.
slide 042 We also looked at redirecting standard output and standard error to a file, but appending it to the contents of a file (although if the file doesn’t exist it is created).
slide 043 Now for something completely different… We’ve already seen how pipes and filters work with using a single program on some input data.
slide 044 i.e. you have a program which takes some arguments, the program processes these arguments, and some results are output.
slide 045 But what about running the same program separately, for each input? i.e. doing each of these program runs in sequence, one after the other.
slide 046 So instead, we want to run the program separately on each argument. We could of course do this manually, but what if we need to do this with a great many arguments? This wouldn’t be terribly efficient!
slide 047 The good news is that there is a well known programming concept which you can use, called loops. Loops are very useful – it is difficult to overstate just how useful these can be…
slide 048 So how can they help us with this situation?
slide 049 You’ve probably encountered compressed files before (like .zip files), it’s a common technique for reducing the size of a number of files whilst packaging them into a single, easy-to-manage file. If we consider these pdb files as large files, perhaps we want to email each of them to different individuals.
slide 050 We can use the zip command very easily to compress our cubane.pdb into a zip file.
slide 051 We can see that it has compressed the file by 73%. Zip is a handy tool in itself which can also work with directories and their contents. As an aside, you can use e.g. “unzip cubane.zip” to decompress the zip file and extract the cubane.pdb file.
slide 052 The first argument is zip filename we wish to create.
slide 053 The second is a list of files (just one in this case) which we want to add to the zip file.
slide 054 This would obviously take too long if we were looking at, say, a hundred files. So how can we automate this using loops?
slide 055 Using a loop, we can iterate over each file, and run zip on each of them. So how does this work?
slide 056 The first part says we wish to iterate over our pdb files.
slide 057 And on each iteration, let us run our zip command.
slide 058 The “done” part shows the end of our loop, and instructs the loop to do the next file in our pdb list if one exists.
slide 059 The semicolons separate each part of this command. But how does it pick up and use each separate pdb file?
slide 060 This *.pdb would generate a list of the 6 pdb files, so we would expect this loop to run 6 times.
slide 061 This “file” is a variable which we can use to reference each file within the loop.
slide 062 e.g. if $file is “cubane.zip”, this becomes cubane.pdb.zip.
slide 063 So here we are just using the “file” variable to specify the file we wish to put in the zip file.
slide 064 The zip command runs thus 6 times, once for each pdb file, generating a new zip file for each of them.
slide 065 So, with a hundred such files, this would be much more efficient than running zip individually each time.
slide 066 If we look at just the zip files, we can see the new ones that have just been created.
slide 067 Let’s look at a slightly different problem… What if we wanted to output the first line of each pdb file?
slide 068 This isn’t really what we want…
slide 069 Each first line in this list is prefixed with the filename of where it came from.
slide 070 So when run on multiple files, the head command inserts a filename before each file.
slide 071 Perhaps we only want the actual first lines. In which case, we really want to miss out the filename prefixes you get from using the head command over multiple files.
slide 072 The good news is that we can use loops to help here.
slide 073 Using a loop, we can run the head command separately on each file.
slide 074 We can do this in a very similar way to how we used a loop for zipping multiple files separately.
slide 075 We get these results. So how does this fit in with what we have learned already with pipes and filters?
slide 076 We can take this further… what if we wanted this list sorted in reverse afterwards?
slide 077 We simply pipe the output from our loop to a command. Although not necessary, we can surround the loop part with parenthesis for clarity, so we can clearly see where the pipe is applied.
slide 078 So we just add the “sort” command to the end of this via a pipe. The “-r” argument just means “sort the list in reverse”.
slide 079 And we get our previous list sorted in reverse. So we can happily use this technique within the pipes and filters model we’ve already learned!
slide 080 In summary, what new things have we looked at in part 2? We’ve used the zip command to create a zip file, and used loops to repeat a command many times. This is only the beginning of what you can do with loops. As an exercise, why not take some time to find out what else you can do with them?
slide 081

  1. @IsaacG2
    August 4th, 2011 at 00:41 | #1

    1. Towards the end you’ve got “for file in ls *.pdb ; do …”
    The ls is a typo

    2. Please, please, pleeeeeease teach people to quote variables especially when with a “for file in *”

  1. No trackbacks yet.