Up: The Shell

Finding Things

January 28th, 2012 Leave a comment Go to comments
slide 01 Hello, and welcome to the sixth episode of the Software Carpentry lecture on the Unix shell. This short episode will show you how to find things in files, and how to find files themselves.
slide 02 We’re looking at how to interact with a computer using a command-line shell.
slide 03 Listening to how people talk about search, you can often guess their age.
slide 04 Just as young people use “Google” as a verb…
slide 05 …crusty old Unix programmers use the word “grep”.
slide 06 “grep” is a contraction of “global/regular expression/print”, which was a common sequence of operations in early Unix text editors.
slide 07 What the grep program does is find and print lines in files that match a pattern.
slide 08 Here’s the file we’ll use for our examples. It contains three computer haikus taken from a competition that Salon magazine ran in 1998.
slide 09 Let’s run the command grep not haiku.txt.
slide 10 Here, “not” is the pattern we’re searching for.
slide 11 It’s a pretty simple pattern: every alphanumeric character matches against itself.
slide 12 After the pattern comes the name or names of the files we’re searching in.
slide 13 As you can see, the output is the three lines in the file that contain the letters “not”.
slide 14 Let’s try a different pattern: “day”. This time, the output is lines containing the words “Yesterday” and “Today”, which both have the letters “day”.
slide 15 If we give grep the -w flag, it restricts matches to word boundaries, so that only lines with the word “day” will be printed, not lines with “Today” or “daytime”. In this case, there aren’t any, so grep‘s output is empty.
slide 16 Another useful option is -n, which numbers the lines that match.
slide 17 Here, we can see that lines 5, 9, and 10 in the file contain the word “it” (or a word that contains “it”).
slide 18 As with other Unix commands, we can combine flags…
slide 19 …to get only whole-word matches, with line numbers.
slide 20 Here’s another example.
slide 21 -i makes matching case-insensitive…
slide 22 …while -v inverts the match, so that it only prints lines that don’t match the pattern.
slide 23 grep has lots and lots of options.
slide 24 To find out what they are, we can type man grep.
slide 25 man is the Unix “manual” command; it prints a description of a command and its options, and (if you’re lucky) provides a few examples of how to use it.
slide 26 grep‘s real power doesn’t come from its options, though; it comes from the fact that its patterns can be regular expressions.
slide 27 (That’s what the “re” in “grep” stands for.)
slide 28 Regular expressions are complex enough that we’ve devoted an entire lecture to them; if you want to do complex searches, please take a few minutes to watch its first few episodes.
slide 29 One caution: grep‘s regular expressions use a slightly different syntax than what’s used in most programming languages.
slide 30 However, the basic ideas and rules are exactly the same.
slide 31 While grep finds lines in files, the find command finds files themselves.
slide 32 Again, it has a lot of options—too many to cover here.
slide 33 To show how its basic features work, we’ll use this directory tree. Under Vlad’s home directory is one file, notes.txt, and three subdirectories: thesis (which is sadly empty), data (which contains two files first.txt and second.txt), and a tools directory that contains the programs format and stats, and an empty subdirectory called old.
slide 34 Here’s a textual representation of that same tree, created using the Unix tree command.
slide 35 As with ls -F, trailing /‘s show directories…
slide 36 …and trailing *‘s show files we could run as programs.
slide 37 For our first command, let’s run find . -type d.
slide 38 Here, . is the root directory of our search: find will only look in it, and the things it contains.
slide 39 -type d means “things that are directories”.
slide 40 Sure enough, find‘s output is the names of the five directories in our little tree (including ., the current working directory).
slide 41 If we change -type d to -type f, we get a listing of all the files instead. find automatically goes into subdirectories, their subdirectories, and so on to find everything that matches the pattern we’ve given it.
slide 42 If we don’t want to go that deep, we can use -maxdepth to restrict the depth of search. Here, -maxdepth 1 tells find to only look at this level, so the only file it finds is ./notes.txt.
slide 43 The opposite of -maxdepth is -mindepth, which tells find to only report things that are at or below a certain depth. -mindepth 2 therefore finds all the files that are two or more levels below us.
slide 44 And here’s another option: -empty. This restricts matching to empty files and directories, of which we have two.
slide 45 We can search by permissions, too. Here, for example, we can use -perm -u=x to find both files and directories for which the user has ‘x’ permission.
slide 46 Combine this with -type f to exclude directories, and voila: a list of runnable program files.
slide 47 Let’s try matching by name with find . -name *.txt. We expect it to find all the text files, but it only prints out ./notes.txt. What’s gone wrong?
slide 48 If you recall, the shell expands wildcard characters like * before commands run.
slide 49 Since *.txt in the current directory expands to notes.txt, the command we actually ran was find . -name notes.txt. find did what we asked; we just asked for the wrong thing.
slide 50 Let’s try again, but this time we’ll put *.txt in single quotes to prevent the shell from expanding the * wildcard.
slide 51 This way, find actually gets the pattern , not the expanded filename notes.txt.
slide 52 Sure enough, this time the output is the names of all three text files.
slide 53 As we said in previous episodes, the command line’s power lies in combining tools. We’ve seen how to do that with pipes; let’s look at another technique.
slide 54 As we just saw, find . -name '*.txt' gives us a list of all text files in or below the current directory.
slide 55 Here’s how to combine that with wc -l to count the lines in all those files.
slide 56 The trick here is to put the find command inside back quotes.
slide 57 This tells the shell to run find and then replace what’s in the back quotes with the command’s output.
slide 58 This is exactly what the shell does when it expands *, ?, and other built-in wildcards, but more flexible, since we can use any command we want as our own “wildcard”.
slide 59 So, when the shell executes this line, the first thing it does is run the command that’s inside the back quotes. Its output is the three filenames ./data/first.txt, ./data/second.txt, and ./notes.txt.
slide 60 The shell then replaces the back quotes with that output to construct the command wc -l ./data/first.txt ./data/second.txt ./notes.txt.
slide 61 And as you can see, that does what we originally wanted.
slide 62 It’s very common to use find and grep together. The first finds files that match a pattern; the second looks for lines inside those files.
slide 63 Here, for example, we can find PDB files that contain iron atoms by looking for the string “FE” in all the .pdb files below the current directory.
slide 64 So far, we have focused exclusively on finding things in text files. What if your data isn’t text?
slide 65 What if we have images, databases, spreadsheets, or some other format? There are basically three options.
slide 66 The first is to extend tools like grep to handle those formats.
slide 67 This hasn’t happened, and probably won’t, because there are too many formats to support.
slide 68 The second option is to convert the data to text, or extract the text-y bits from the data. This is probably the most common approach, since it only requires people to build one tool per data format (to extract information).
slide 69 On the positive side, this makes simple things easy to do.
slide 70 On the negative side, complex things are usually impossible. For example, it’s easy enough to write a program that will extract X and Y dimensions from image files for grep to play with, but how would you write something to find values in a spreadsheet whose cells contained formulas?
slide 71 The third choice is to recognize that the shell and text processing have their limits, and to use a programming language such as Python instead.
slide 72 When the time comes to do this, don’t be too hard on the shell: many modern programming languages, Python included, have borrowed a lot of ideas from it.
slide 73 And imitation is also the sincerest form of praise.

  1. @IsaacG2
    September 2nd, 2010 at 17:23 | #1

    1. Rather than using command substitution (the `s) which breaks on filenames with whitespace, use the find -exec; it’s the “proper” approach.

    2. The ‘file’ command “knows” many many file formats. Useful in some scripts.

  2. klahnb
    January 18th, 2011 at 08:04 | #2

    -nice introduction to some shell commands which, quite frequently, are found to be very useful. Some things come to mind:

    1. If you are looking for a command, but don’t know what it might be called, you can use “man -k _keyword_” to search for a keyword in the command descriptions. This is identical to the “apropos _keyword_” command. Caveat: it can be difficult to choose a keyword that is unique enough to not get too many results; and the command descriptions vary in usefulness. Use “man man” to see other options.

    2. The ‘strings’ command can be used to find string encodings in binary formats (.doc, etc.). If you just want to search for a word in a binary file, it’s worth a try to run it on the file and pipe it’s output to grep. But if you have python installed, you may as well use it’s capabilities.

    3. It may (or may not) be worth noting: Rather than using back quotes, the newer syntax for command substitution (at least in bash) is:
    $(_a_command_)
    I *believe* part of the reasoning is that it’s less ambiguous when nesting substitutions, and clearer when read. To do this with back quotes, one must escape the inner quotes with a backslash. Here one also needs to remember when the backslash is to be literal, or it is interpreted as an escape character.
    That said, I use back quotes a lot. -slightly faster to type. I’m normally I’m not nesting at the prompt, in a one-liner. The GNU bash manual doesn’t go so far as to note the older form as “depreciated”, so . . .

    P.S.
    A similar-looking form is:
    $((_an_expression_))
    -useful in scripts, but also in the interactive shell.
    echo $((_your_expression_)) . . .
    . . . at the command prompt to do a quick math calculation (without entering bc, etc.); or see if your conditional expression is returning what you thought it should. For example:
    echo $((8745 + 98733)) (returns “107478?)
    echo $((2 + 2 == 5)) (returns “0? for “False”)
    Actually, “printf” is more robust and featureful than “echo”. (Classic Shell Scripting; Beebe and Robbins; O’Reilly)

  3. Terri Yu
    January 19th, 2011 at 01:53 | #3

    Unfortunately, there is no tree command in Cygwin.

  4. wenwei
    March 29th, 2011 at 14:38 | #4

    when I type tree in ubuntu shell , it ask me to install tree program with the command: sudo apt -get install tree, which resulted in “sudo :apt: command not found” , so how to have the tree program run in shell? Thanks

  5. March 31st, 2011 at 00:51 | #5

    @wenwei – There is no space after apt. It should be sudo apt-get install tree which will work fine.

  1. No trackbacks yet.