The Shell

January 12th, 2012 Leave a comment Go to comments

Below is a link to the slides for this session; a nice text version is being posted below one piece at a time.  Until that’s done, just download the  Slides for “Intro to Shell”, Nov 2011 Bootcamp

Our Problem Description

A cochlear implant is a small electronic device that is surgically implanted in the inner ear to give deaf people a sense of hearing.  Less than a quarter of a million people have them, but there is still no widely-accepted benchmark to measure their effectiveness.  In order to establish a baseline for such a benchmark, our supervisor got  teenagers with CIs to listen to audio files on their computer and  report:

  •  the quietest sound they could hear
  •  the lowest and highest tones they could hear
  •  the narrowest range of frequencies they could discriminate

 

To participate, subjects attended our laboratory and one of our lab techs played an audio sample, and recorded their data – when they first heard the sound, or first heard a difference in the sound.  Each set of test results were written out to a text file, one set per file.

Each participant has a unique subject ID, and a made-up subject name.

Each experiment has a unique experiment ID.
Our job is to do some preliminary analysis on that data.
  We need to:

  • identify and label files that are missing data (for example, because the participant didn’t complete all three tests);
  • normalize the data (the first version of the software reported a score for each test in the range 0-9, but it was later “fixed” to report scores in the range 1-10);
  • put the data into a database to make subsequent analysis easier;  and
  • calculate a few simple statistics, such as average scores for each test by CI model and participant’s age and sex.

The experiment has collected 351 files so far, and we expect to get another 30-40 per week for the next couple of months, so we’d really like to automate the four steps above.

The Data

But the data is a bit of a mess.   The directory tree looks like this:

There are multiple directories (including multiple `data’ directories); inconsistent naming of the data files; and extraneous NOTES files.

Our job is going to have all the data files, all named with .txt, and without extraneous files, in one directory (called ‘cleandata’).

Using the Shell – Productivity

To do this, we’re going to use the terminal, and a shell, not a GUI.   The terminal and the shell are examples of command line interfaces, (like working with ipython or idle), where commands are entered as text on lines, as opposed to a graphical user interface, or GUIs, like Mac OSX’s finder or the Windows interface.

 

Now, the CLI has a rich and noble history going back to the 60s and mechanical teletypes; GUIs are much newer (coming out of Xerox PARC, and made popular by Apple).  Why on earth would one use a command line interface to manipulate files, when one can just click-and-drag in a GUI?

GUIs are very good at operating an existing system.  You can click on existing controls, and use existing functionality.  A good GUI can be enormously productive at operating existing software packages.    A CLI is, for better or worse, a blank canvas, and it’s much harder to figure out where to begin when you start:

But that blank canvas is a blessing as well as a curse.   The command line is much better for creating/expressing new things.   Our programming was done in something that looked a lot more like a CLI than like a GUI for a good reason;  general-purpose programming in a GUI is really hard.

Shell – Reproducability

One important aspect of command lines is reproducability.   Command lines can be cryptic to learn, but once you have the command, it’s easy to document (you can cut and paste the exact line somewhere) and you can use it tomorrow.  So you can communicate the command to yourself, in the future – and  you can communicate it to others exactly.    Have you ever had a conversation like this one when trying to explain to friends or parents how to use their computer?

“Click on Filters, then ‘Recent’”
“Then drag the green arrow down to the big grey box..”
“… No, the other one..”
“… Not there!”
“Ok, let’s start again…”

Shell – Automatability

Once one has crafted a command which does what one wants, the shell has powerful features for automating the use of that command.   Loops, recursive applications, wildcards… there are ways to leverage that one command to (say) rename a file to apply it to thousands of files at the same tmie

Shell – Reproducability + Automatability = Productivity

Because command line interfaces are reproducable, you can stop wasting time re-discovering how to do things.   Because it’s automatable, you can do the same thing hundreds of times easily without wasting time.   In short, you can spend less time doing file manipulations and more times doing research.

For the task in hand, you could easily click and drag all the files from all the directories into a new directory, and then – more tediously, but still – rename each file to have a .txt extension.   But two months later, when there was another 300 files to process, you’d have to do the same thing again, with nothing gained from having done it earlier.   With the shell, we’ll learn how to do this in a few commands, and to save these commands in a script, which could be rerun again months later and still work.

Yes, there’s a learning curve in mastering the shell – but it’s an investment in future productivity.

Getting started: Hello World

So let’s start by opening a terminal.  On a Mac, the Terminal application is under /Applications/Utilities; copy that to the dock and launch it.   On Linux, there’s a variety of ways to launch a terminal.  On windows, we recommend installing Cygwin, which provides a linux-like environment on your  windows machine (we promise, it’s painless), then just click the cygwin icon that is installed on your desktop.

Once you have a terminal open, you’re at the shell.  The shell is a program which mediates between you, the user, and your computer; it provides you a fairly low-level access to the resources of your computer – the file system, network access, and the like.

So let’s get started.  Try entering the command hello="world" (no spaces).   If you get an error like “command not found”, then your default shell is probably a shell called tcsh.   There’s no shame in that, but the usual shell people use nowadays is called bash, so try starting a bash shell by typing “bash” and pressing enter.   Then try the command again.

Once this works, try the following echo commands (on my machine, the prompt upon setup is segfault:~ ljdursi$; yours will likely be different:

segfault:~ ljdursi$ hello=”world”

segfault:~ ljdursi$ echo Hello, world
Hello, world

segfault:~ ljdursi$ echo Hello, $hello
Hello, world

echo just echos whatever arguments it is given.   If one of the arguments is a variable (eg, $hello) that we’ve set, the shell evaluates the variable and then it is printed out.

Echo does nothing but echo what it is given; if you had a file named test.py in your directory, echo test.py does:

segfault:~ ljdursi$ echo test.py
test.py

nothing special; it just outputs the string test.py.

Getting started: Moving around the Filesystem

Now let’s learn how to start moving around amongst our files and directories.  This is easy to do in a GUI (just click on folders), and seemingly harder here, but you get very fast at it in the shell…

Let’s start poking around. Type pwd . This prints current “working” directory – where you are in the file structure. Type ls – that will list the files in that directory.

On my system, this looks like:

segfault:~ ljdursi$ pwd
/Users/ljdursi

segfault:~ ljdursi$ ls
Applications     Projects          gol.py
Classes          Public            intro-gpu
Codes            Shared            latex
Desktop          Sites             maxout.gnuplot
Documents        Software          octave
Downloads        Talks             papers
Dropbox          addresses.txt     personal  
Library          bin               svn
Movies           configurationdata tmp   
Music            debruijn.sc        

segfault:~ ljdursi$

(Note that my home directory is a little cluttered, which is a personal failing.)

Directories may be more familiar to you as folders; they’re often called that because they’re represented that way on your computer’s desktop.   Directories can contain files, and other directories.

When you launch a shell, it starts in your home directory; by convention, that will usually be /Users/[username] or /home/[username] or something similar. Notionally, it’s the top directory of all of your stuff (as opposed to system stuff).   (Note for cygwin users; cygwin will put you in an empty home directory. To see something more exciting, type cd /cygwindrive/c.)

So that directory listing wasn’t all that useful; we’d like to know a little more information about those entries.   We would like to know which entries are directories, which are plain files
, etc.   ls -F labels directories with ‘/’, executables with ‘*’, etc. So give that a try:

segfault:~ ljdursi$ ls -F
Applications/    Projects/         gol.py*
Classes/         Public/           intro-gpu/
Codes/           Shared/           latex/
Desktop/         Sites/            maxout.gnuplot
Documents/       Software/         octave/
Downloads/       Talks/            papers/
Dropbox/         addresses.txt     personal/ 
Library/         bin/              svn/
Movies/          configurationdata tmp/   
Music/           debruijn.sc        

segfault:~ ljdursi$

So far, we’re just looking at the same files in different ways.  That’s sort of boring; let’s move around a bit.  Let’s move into one of the other directories we see.  The cd command, for change directory, does that for us; so type cd followed by one of the directory names, and then do an ls -F.   In my case, going into my Desktop directory:

segfault:~ ljdursi$ cd Desktop

segfault:Desktop ljdursi$ ls -F
40TB.key	cubicAdvection.png
Dursi-HPC.pages cubicAdvection.py
Dursi-HPC.pdf   cubicHeat.png
IntroGPGPU.key	cubicHeat.py
LFBP/
...

segfault:~ ljdursi$ cd

segfault:~ ljdursi$ pwd
/Users/ljdursi

cd without arguments takes you back to your home directory.

So far

So there’s already a few things we can observe.

echo prints output
pwd print current directory
cd [directory] Change Directory
cd Change Directory to home
ls directory LiSting
ls -F LiSting with Filetypes

The commands above are super short and terse — they are designed to be fast/easy to use. But they’re pretty cryptic to learn.    This is frustrating when you are learning, but once you’re used to them, it lets you do things very quickly.

Another thing we can point out is that some commands (eg, ls)  have options which modify their default behaviour.   They are typically of the form of one dash followed by a single letter (eg, -F), or two dashes and a complete word (–help).     How do we know what the options are?

 Manual pages (man)

Most programs have a manual page describing its use and the options.  They are great for finding out more about a command you already use, but less good for learning what a command actually does in the first place.

If you type man [cmdname], you will be able to page through a manual page (press space, or enter, to move through) describing the use of the command in gory detail:

segfault:~ ljdursi$ man ls

LS(1)                     BSD General Commands Manual

NAME
     ls -- list directory contents

SYNOPSIS
     ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1]
        [file ...]

DESCRIPTION
     For each operand that names a file of a type
     other than directory, ls displays its name as
     well as any requested, associated informa-
     tion.  For each operand that names a file of
     type directory, ls displays the names of
...

Many programs have gazillions of options. Don’t be intimidated! No human being who has ever lived has known all the options to ls at same time. Over time you find a few options that you find useful.

 

  1. No comments yet.