8 Class 14. Loops and scripts

Practice with Loops

This exercise uses data found within the data-shell/molecules directory. ls gives the following output:

cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb

What is the output of running the following loop in the molecules directory?

$ for filename in c*

> do

>    ls$filename

> done

How would the output differ from using this command instead?

$ for filename in *c*

> do

>    ls$filename

> done

* matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.

Let’s continue with our example in the data-shell/creatures directory. Here’s a slightly more complicated loop:

$ for filename in *.dat

> do

>     echo $filename

>     head -n 100 $filename | tail -n 20

> done

 

The shell starts by expanding *.dat to create the list of files it will process. The loop body then executes two commands for each of those files. The first, echo, just prints its command-line arguments to standard output.

In this case, since the shell expands $filename to be the name of a file, echo $filename just prints the name of the file. Note that we can’t write this as:

$ for filename in *.dat

> do

>     $filename

>     head -n 100 $filename | tail -n 20

> done

because then the first time through the loop, when $filename expanded to basilisk.dat, the shell would try to run basilisk.dat as a program.

Finally, the head and tail combination selects lines 81-100 from whatever file is being processed (assuming the file has at least 100 lines).

By prefixing the command with echo it is possible to see each command as it would be executed. Echo is a good debugging technique since it shows you what input files are going into your loop and gives you confidence your loop is running.

Nelle’s Pipeline: Processing Files

Nelle is now ready to process her data files using goostats — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:

  1. an input file (containing the raw data)
  2. an output file (to store the calculated statistics)
  3. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:

$ cd north-pacific-gyre/2012-07-03

$ for datafile in NENE*[AB].txt

> do

>     echo $datafile

> done

Her next step is to run the program. She tries it out with one of her files:

The program goostats requires A) an input file and 2) an output file name

Her next step is to decide what to call the OUTPUT files that the goostats analysis program will create. Prefixing each input file’s name with “stats” seems simple, so she modifies her loop to do that:

$ for datafile in NENE*[AB].txt

> do

>     echo $datafile stats-$datafile

> done

NENE01729A.txt stats-NENE01729A.txt

NENE01729B.txt stats-NENE01729B.txt

NENE01736A.txt stats-NENE01736A.txt

She hasn’t actually run goostats yet, but now she’s sure she can select the right files and generate the right output filenames.

Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses the up arrow. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):

$ for datafile in NENE*[AB].txt; do echo $datafile stats-$datafile; done

Now she is ready to figure out how to run her program using the bash command.

If she was just running one file, she could enter each argument manually:

$ bash goostats NENE01729B.txt stats-NENE01729B.txt

 

But Nelle wants to run her program on ALL of her files at once.

Using the left arrow key, Nelle backs up and changes the command echo to bash goostats:

$ for datafile in NENE*[AB].txt; do bash goostats $datafile stats-$datafile; done

 

When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl-C, uses up-arrow to repeat the command, and edits it to read:

$ for datafile in NENE*[AB].txt; do echo $datafile; bash goostats $datafile stats-$datafile; done

 

Beginning and End

We can move to the beginning of a line in the shell by typing Ctrl-a and to the end using Ctrl-e.

When she runs her program now, it produces one line of output every five seconds or so:

NENE01729A.txt

NENE01729B.txt

1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run. As a final check, she opens another terminal window, goes into north-pacific-gyre/2012-07-03, and uses cat stats-NENE01729B.txt to examine one of the output files. It looks good, so she decides to get some coffee and catch up on her reading.

Doing a Dry Run

A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo the commands it would run instead of actually running them.

Key Points

  • A for loop repeats commands once for every thing in a list.
  • Every for loop needs a variable to refer to the thing it is currently operating on.
  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.
  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.
  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.

 

 

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book