Up: The Shell

Exercises

Let’s try out your new shell skills on some real data.

The following file is a small sample (1%) of a very large text file containing human genetics data. Specifically, it describes genetic variation in three African individuals sequenced as part of the 1000 Genomes Project. Please download it from the following url, open up a shell, and go into the directory where it was saved:

http://svn.software-carpentry.org/swc/data/1000gp.vcf

Exercise Part 1 (setup)

  • If you had forgotten where you downloaded the file, how would you locate the path of all files with that name on the computer (using the shell)?
    (Click for a hint)

    $ man find

    (Click for our answer)

    $ find / -name "1000gp.vcf"
  • It’s usually a good idea to use an empty directory as a workspace so that other files don’t get in the way (or accidentally get overwritten or deleted). Create a new subdirectory directory named “sandbox”, move our data file there, and make the directory your current working directory (sandbox should be the last part of the path given when you type pwd.

    (Click for our answer)

    $ mkdir sandbox
    $ mv /home/orion/Downloads/1000gp.vcf sandbox
    $ cd sandbox

Exercise Part 2 (analysis)

  • The data file you downloaded is a line-based text file. The “vcf” extension lets us know that it’s in a specific text format, namely “Variant Call Format”. The file starts with a bunch of comment lines (they start with “#” or “##”), and then a large number of data lines. The human genome can be thought of as an encyclopedia, where each chromosome is a volume. Each volume is just a long string of characters, but rather than the english alphabet, the genome uses just the characters “A”, “C”, “G”, and “T”. This VCF file lists the differences between the three African individuals and a standard “individual” called the reference (actually based upon a few different people). Each line in the file corresponds to a difference. The line tells us the position of the difference (chromosome and position), the genetic sequence in the reference, and the corresponding sequence in each of the three Africans. Research is ongoing to understand the full effects of these genetic differences; some cause diseases such as Tay-Sachs and Hemophilia, while others determine your blood type and eye color.

    Before we start processing the file, let’s get a high-level view of the file that we’re about to work with.

    What’s the file size (in kilo-bytes), and how many lines are in the file?
    (Click for a hint about the file size)

    There’s an option to ls that will print the file sizes in a more human-friendly format.

    (Click for a hint about the number of lines)

    $ man wc

    (Click for our answers)

    We should get a file size around 3.6 MB with:
    $ ls -lh 1000gp.vcf
    Alternatively, the command du can be used to achieve a similar result:
    $ du -h 1000gp.vcf

    We find there are 45034 lines with:
    $ wc -l 1000gp.vcf

  • Because this file is so large, you’re going to almost always want to pipe (“|”) the result of any command to less (a simple text viewer, type ‘q’ to exit) or head (to print the first 10 lines) so that you don’t accidentally print 45,000 lines to the screen.

    Let’s start by printing the first 5 lines to see what it looks like.
    (Click for our answer)

    $ head -5 1000gp.vcf
  • That isn’t very interesting; it’s just a bunch of the comments at the beginning of the file (they all start with “#”)! Print the first 20 lines to see more of the file.
    (Click for our answer)

    $ head -20 1000gp.vcf
  • Okay, so now we can see the basic structure of the file. A few comment lines that start with “#” or “##” and then a bunch of lines of data that contain all the data and are pretty hard to understand. Each line of data contains the same number of fields, and all fields are separated with TABs. These fields are:

    1. the chromosome (which volume the difference is in)
    2. the position (which character in the volume the difference starts at)
    3. the ID of the difference
    4. the sequence in the reference human(s)

    The rest of the columns tell us, in a rather complex way, a bunch of additional information about that position, including: the predicted sequence for each of the three Africans and how confident the scientists are that these sequences are correct.

    To start analyzing the actual data, we have to remove the header. How can we print the first 10 non-header lines (those that don’t start with a “#”)?
    (Click for a hint)

    $ man grep

    (Click for another hint)

    You can use a pipe (“|”) to connect the output of grep to the input of head.

    (Click for yet another hint)

    In grep regular expressions, the carat ‘^’ character matches the start of a line and the dollar sign ‘$’ matches the end of a line. Thus, the following will print all non-blank lines from file:
    $ grep -v "^$" file

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | head

    Why are neither of these correct?
    $ grep -v "#" 1000gp.vcf | head
    $ grep -v "^##" 1000gp.vcf | head

  • How many lines of data are in the file (rather than counting the number of header lines and subtracting, try just counting the number of data lines)?
    (Click for a hint)

    Instead of piping to head, try piping to wc.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | wc -l
    should print 45024
  • Where these differences are located can be important. If all the differences between two encyclopedias were in just the first volume, that would be interesting. The first field of each data line is the name of the chromosome that the difference occurs on (which volume we’re on). Print the first 10 chromosomes, one per line.
    (Click for a hint)

    You can extract a column from a tab-delimited text file using the cut command.

    (Click for another hint)

    Use grep to print only non-comment lines, and cut to extract the chromosome column.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | cut -f 1 | head
  • As you should have observed, the first 10 lines are on numbered chromosomes. Every normal cell in your body has 23 pairs of chromosomes, 22 pairs of “autosomal” chromosomes (these are numbered 1-22) and a pair of sex chromosomes (two Xs if you’re female, an X and a Y if you’re male). If you’ve heard of the genetics company 23andMe, the 23 refers to these 23 pairs of chromosomes.

    Let’s look at which chromosomes these variations are on. Print a list of the chromosomes that are in the file (each chromosome name should only be printed once, so you should only print 23 lines).
    (Click for a hint)

    You need to remove all the duplicate lines from your previous answer.

    (Click for a hint)

    sort has an option that should make this easier.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort -u
  • Rather than using sort to print unique results, a common pipeline is to first sort and then pipe to another UNIX command, uniq. The uniq command takes sorted input and prints only unique lines, but it provides more flexibility than just using sort by itself. Keep in mind, if the input isn’t sorted, uniq won’t work properly.

    Using sort and uniq, print the number of times each chromosome occurs in the file.
    (Click for a hint)

    $ man uniq

    (Click for another hint)

    Instead of using sort to remove duplicates, just use it to sort and pipe the result to uniq.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c
  • Add to your previous solution to list the chromosomes from most frequently observed to least frequently observed.
    (Click for a hint)

    $ man sort

    (Click for another hint)

    Make sure you’re sorting in descending order. By default, sort sorts in ascending order.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -n -r
    should output the following:

    3721 2
    3387 1
    3224 4
    3219 3
    2894 5
    2860 6
    2527 8
    2525 7
    2203 10
    2166 11
    2032 12
    1865 9
    1656 13
    1409 14
    1362 16
    1304 X
    1275 18
    1265 15
    1097 17
     993 20
     814 19
     661 21
     565 22
    
  • The autosomal chromosomes (1-22) are named according to their size. The largest of them is chromosome 1, while the smallest is chromosome 22. Does it look like differences occur relatively randomly across the genome, or are some chromosomes more different than you’d expect at random (very roughly taking their sizes into account)?
    (Click for a hint)

    It’s worth noting that the chromosomes were numbered by the sizes of the actual molecules, not how much of them had been sequenced.

    Wikipedia has a nice table of chromosome sizes and how much of each has been sequenced (and you can sort it):
    http://en.wikipedia.org/wiki/Human_chromosome#Human_chromosomes

    Notice anything?

    (Click for our hypothesis)

    Since variation can only be found in the known sequence, the order you printed corresponds closely to ordering by the number of bases sequenced (rather than the total number of bases).

    Given this, it seems like differences occur relatively randomly across the genome. We see more differences on longer chromosomes, fewer on shorter, without any striking outliers.

  • This is great, but biologists might also like to see the chromosomes ordered by their number (not dictionary order), since different chromosomes have different attributes and this ordering allows them to find a specific chromosome more easily.
    (Click for another hint)

    A lot of the power of sort comes from the fact that you can specify which fields to sort on, and the order in which to sort them. In this case you only need to sort on one field.

    (Click for our answer)

    $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -k 2n

Exercise Part 3 (scripts and svn)

  • Wonderful! Now we have a (long) command for printing chromosome statistics from our 1000gp.vcf file. Using nano, create a new file, “chrom_stats.sh”, with just your answer to the previous question in it.
    (Click for our answer)

    Type the following to open a new file:
    $ nano chrom_stats.sh
    Type in the command. Type ^o to save and ^x (where ^ means the control key).
  • Just to be illustrate the flexibility of the shell, try creating the same file directly from the shell (without a text editor). Once you do, you can use cat to make sure the contents of the file are exactly what you expect.
    (Click for a hint)

    You can use echo to print something and > to redirect to a file.

    (Click for another hint)

    Since our long command has double-quotes in it, you either need to use single-quotes or escape these with back-slashes.

    (Click for our answer)

    $ echo 'grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -k 2n' > chrom_stats.sh
    $ cat chrom_stats.sh
  • Now, execute your new script to print the chromosome statistics.
    (Click for a hint)

    You may have to change the permissions to allow you to execute it.

    (Click for a hint)

    It’s good form to only make permissions as permissive as necessary. So, rather than allow everyone to execute the file, it is better to just allow you to execute it.

    (Click for our answer)

    $ chmod u+x chrom_stats.sh
    $ ./chrom_stats.sh

    Note that it is u+x instead of just +x or a+x. This only adds the ability for the owner to execute it, whereas the other two options would allow anyone to execute it.

  • We’d like to be able to use this script in the future with arbitrary VCF files, instead of just our 1000gp.vcf file. Edit the script so that it takes VCF-formatted text input on stdin and prints out chromosome statistics on stdout. This is simpler than you might think.
    (Click for a hint)

    If grep isn’t given an input file, it will read from stdin.

    (Click for our answer)

    Change
    grep -v "^#" 1000gp.vcf | ...
    to
    grep -v "^#" | ...

    * Since this is in a file instead of the shell prompt, we aren’t showing the “$” at the beginning of the line.

  • Now that we have a script that reads from stdin and prints to stdout, how do we run it on the 1000gp.vcf file to get the same output as before?
    (Click for a hint)

    The cat command is used to print files to stdout.

    (Click for another hint)

    You can pipe the output of cat directly into our script.

    (Click for yet another hint)

    Just like before, in order to tell the shell that the chrom_stats.sh file we want to execute is the one in our current directory, we need to use ./chrom_stats.sh.

    (Click for our answer)

    $ cat 1000gp.vcf | ./chrom_stats.sh
  • Finally, add a copy of this file to your folder in the class SVN repository.
    (Click for our answer)

    1. cp chrom_stats.sh /path/to/repo/participants/user/
    2. Add the file to subversion version control
    3. Commit your changes
  • Fin. Comments, questions, and suggestions are encouraged and appreciated.
    Thanks to Tommy Guy, Jon Pipitone, Greg Wilson, and Elango Cheran for their help with these exercises.
  1. January 19th, 2011 at 02:40 | #1

    “Add to your previous solution to list the chromosomes from most frequently observed to least frequently observed.”

    The -n option appears to be necessary, at least on Ubuntu 10.04.

    • January 19th, 2011 at 03:27 | #2

      Thanks for the heads-up. It worked on all the systems I tested it on, but portability is important. I’ll fix the note. Out of curiosity, when you did uniq -c, were the counts right-aligned?

  2. January 19th, 2011 at 04:38 | #3

    Yes, the counts were right-aligned.

  3. January 19th, 2011 at 04:53 | #4

    @Ethan White
    Hmm, but sort didn’t sort them correctly? Is Ubuntu’s sort not dictionary-ordered? What is your output for the uniq -c | sort -r command, without -n? Thanks.

  4. January 19th, 2011 at 13:30 | #5

    The sort order for unic -c | sort -r is:

    993 20
    814 19
    661 21
    565 22
    3721 2
    3387 1
    3224 4

    with the rest of the sort as expected.

  5. January 19th, 2011 at 13:34 | #6

    oops. The spaces got stripped by the code tags (should I use for this sort of thing?). The counts are right aligned and the chromosomes are left aligned.

  6. January 19th, 2011 at 13:42 | #7

    And since that’s still confusing since all of the short numbers also have the largest first digit, here’s something equivalent (I’m going to try pre tags this time):

    6 5
    6 4
    4 6
    2 8
    20 1
    18 3
    1 7
    15 2

  7. January 19th, 2011 at 13:44 | #8

    … which apparently still strips the leading white space (the counts are once again right aligned), so it looks like the sort is occurring on the first number first regardless of the alignment.

  8. January 19th, 2011 at 17:05 | #9

    Okay, thanks for the information. That’s very interesting, and good to know. I think wrapping the whole block in a pre tag should do it. You can wrap that in a code tag to get the monospace font too. Test example:

     565 22
    3721 2
    

  9. January 20th, 2011 at 21:34 | #10

    Why use “cat” instead of “<" ?

  10. January 20th, 2011 at 21:41 | #11

    @Nick Barnes
    Good point. It’s mostly a matter of preference, though there seem to be a number of reasons to use “<” instead. I usually use cat because it makes it clearer when reading a long piped command and it makes it easier to add additional commands between the cat and the receiving command.

  11. david babalola
    January 21st, 2011 at 01:37 | #12

    what could be wrong if shell prompt (dollar sign )does not come up on the next line after pressing ENTER after a command line?

  12. January 21st, 2011 at 01:39 | #13

    @david babalola
    It’s likely that you have an unpaired parenthesis, brace, or quote. Type CTRL+C to get back to the shell.

  13. January 21st, 2011 at 02:39 | #14

    @david babalola
    Another alternative is that you have run a program that is trying to read from stdin. In that case, CTRL+C or CTRL+D should stop it.

  14. david babalola
    January 21st, 2011 at 04:02 | #15

    hi,
    i could not save my work to the repository. its telling me “no such file or directory” what do i need to do?

  15. January 21st, 2011 at 04:23 | #16

    @david babalola
    Hey David,
    Sorry you’re having trouble. Did you copy the file into your working copy of the repository? Did you add the file to version control?

    What directory are you in (pwd), what are the files in that directory (ls), and what is your command history (history | tail)?

  16. david babalola
    January 21st, 2011 at 05:04 | #17

    i think the trouble started since i introduced command ‘nano’ . The response it gave was ‘bach:nano: command not found’

  17. david babalola
    January 21st, 2011 at 05:14 | #18

    current working directory is ‘home/100431522/sandbox
    files are ’1000gp.vcf,chrom_stats.sh,sandbox
    command history, i stop at ‘cp chrom_stats.sh/path………

  18. January 21st, 2011 at 05:16 | #19

    @david babalola
    Okay, it’s possible that you don’t have nano on your machine. In that case, you can try pico, which is similar, or just continue on with the exercises. The next question after the nano one shows you how you can create the file without using a text editor at all (echo "command" > file).

  19. January 21st, 2011 at 05:19 | #20

    @david babalola
    Where is your local copy of the SVN repository? You should copy chrom_stats.sh into your participants directory inside of that. Then you should cd to that directory, add it with svn add, and then commit your changes.

  20. Jack
    January 21st, 2011 at 20:40 | #21

    Exercise 2 has a mistake. The question asked:

    Print the first 10 chromosomes, one per line.

    The answer givens says:

    grep -v “^#” 1000gp.vcf | cut -f 1 | head

    but should be:

    grep -v “^#” 1000gp.vcf | cut -f 1 | head -10

    Right?

  21. Jack
    January 21st, 2011 at 20:43 | #22

    @Jack
    Answering my own question – I see that the default for head “is” the first ten lines. Doh!

  22. January 21st, 2011 at 22:26 | #23

    @Jack
    No worries. That was one of the points of putting that in there. :)

  23. Brynjar
    January 23rd, 2011 at 22:31 | #24

    These were fun exercises. I have used Linux and Mac Os X for many years and I am comfortable around the shell but it was nice to see some good tips on what is possible to do with the shell. I would never have thought that analyzing Bio datasets from shell could be so simple.

  24. luis
    January 26th, 2011 at 15:30 | #25

    @Orion Buske
    If it is the first case, the unmatched parenthesis/quote/etc, you’ll probably see a “>” prompt instead of a “$” prompt.

    If you see no prompt at all, it’s likely the second case: something is reading from stdin. You can interrupt it (ctrl-c) or send it an end of input character (ctrl-d).

    [Edit: this was meant to be a reply to @david babalola, comment #12]

  25. January 26th, 2011 at 15:40 | #26

    @luis
    Indeed. I guess I forgot about the “>”. Thanks for clarifying that point.

  26. January 26th, 2011 at 15:41 | #27

    @Brynjar
    Thanks for your feedback! I’m glad you enjoyed them. It’s always impressive how much can be done by combining a few shell commands together. Once you include a few sed/awk/perl one-liners as well, you can do almost anything with line-based text files.

  27. Arbora Resulaj
    January 31st, 2011 at 20:41 | #28

    Is there a good online reference for all the different shell commands?

  28. Arelia
    February 2nd, 2011 at 00:17 | #29

    What happens if you “commit” your chrom_stats.sh file to the repository without “adding” it first?

  29. February 2nd, 2011 at 00:17 | #30

    @Arbora Resulaj
    There isn’t one that I use, although a quick google found this nice summary:
    http://ss64.com/bash/

  30. February 2nd, 2011 at 15:53 | #31

    @Arelia
    You have to ‘svn add’ the file in order to have svn track it. If you do a commit without it, svn will just ignore the file. It will stay in your working copy, but it won’t be committed to the repository.

  31. Loïc Séguin
    March 3rd, 2012 at 03:16 | #32

    For the last exercise, it turns out there is a better way of achieving the same thing (and more!). In a script, $1 refers to the first command line argument of the script. Thus the file can contain

    grep -v ^# $1 | cut -f 1 | sort -n | uniq -c

    and it will work with a file name as in

    $ ./chrom_stats.sh 1000gp.vcf

    If no file name is given, then $1 has no value and the script will read from standard input.

  32. March 4th, 2012 at 06:11 | #33

    @Loïc Séguin
    This is a good point!

    As an aside, I put `set -o pipefail`, `set -o errexit`, and `set -o nounset` at the top of all of my shell scripts to prevent runaway errors and catch obvious typos. The last of these makes it less straightforward to do what you are suggesting. :)

  1. January 18th, 2011 at 11:48 | #1