9 Class 15. Writing scripts

Writing and running scripts:

Content from The Software Carpentries, CC-BY.

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.

Let’s start by going back to molecules/ and creating a new file, middle.sh which will become our shell script:

$ cd molecules

$ nano middle.sh

 

The command nano middle.sh opens the file middle.sh within the text editor “nano” (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file – we’ll simply insert the following line:

head -n 15 octane.pdb | tail -n 5

This is a variation on a pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Then we save the file (Ctrl-O in nano), and exit the text editor (Ctrl-X in nano). Check that the directory molecules now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

$ bash middle.sh

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00

ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00

ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00

ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00

ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

 

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

Text vs. Whatever

We usually call programs like Microsoft Word  “text editors”, but we need to be a bit more careful when it comes to programming. By default, Microsoft Word uses .docx files to store not only text, but also formatting information about fonts, headings, and so on. This extra information isn’t stored as characters, and doesn’t mean anything to tools like head: they expect input files to contain nothing but the letters, digits, and punctuation on a standard computer keyboard. When editing programs, therefore, you must either use a plain text editor, or be careful to save files as plain text.

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit middle.sh and make it more versatile:

$ nano middle.sh

Now, within “nano”, replace the text octane.pdb with the special variable called $1:

head -n 15 “$1” | tail -n 5

 

Inside a shell script, $1 means “the first filename (or other argument) on the command line”. We can now run our script like this:

$ bash middle.sh octane.pdb

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00

ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00

ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00

ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00

ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00

 

or on a different file like this:

$ bash middle.sh pentane.pdb

 

Double-Quotes Around Arguments

For the same reason that we put the loop variable inside double-quotes, in case the filename happens to contain any spaces, we surround $1 with double-quotes.

We still need to edit middle.sh each time we want to adjust the range of lines, though. Let’s fix that by using the special variables $2 and $3 for the number of lines to be passed to head and tail respectively:

$ nano middle.sh

head -n “$2” “$1” | tail -n “$3”

We can now run:

$ bash middle.sh pentane.pdb 15 5

ATOM      9  H           1       1.324   0.350  -1.332  1.00  0.00

ATOM     10  H           1       1.271   1.378   0.122  1.00  0.00

ATOM     11  H           1      -0.074  -0.384   1.288  1.00  0.00

ATOM     12  H           1      -0.048  -1.362  -0.205  1.00  0.00

ATOM     13  H           1      -1.183   0.500  -1.412  1.00  0.00

 

By changing the arguments to our command we can change our script’s behaviour:

$ bash middle.sh pentane.pdb 20 5

ATOM     14  H           1      -1.259   1.420   0.112  1.00  0.00

ATOM     15  H           1      -2.608  -0.407   1.130  1.00  0.00

ATOM     16  H           1      -2.540  -1.303  -0.404  1.00  0.00

ATOM     17  H           1      -3.393   0.254  -0.321  1.00  0.00

TER      18              1

 

 

This works, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:

$ nano middle.sh

# Select lines from the middle of a file.

# Usage: bash middle.sh filename end_line num_lines

head -n “$2” “$1” | tail -n “$3”

 

A comment starts with a # character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people (including your future self) understand and use scripts. The only caveat is that each time you modify the script, you should check that the comment is still accurate: an explanation that sends the reader in the wrong direction is worse than none at all.

What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type:

$ wc -l *.pdb | sort -n

 

because wc -l lists the number of lines in the files (recall that wc stands for ‘word count’, adding the -l flag means ‘count lines’ instead) and sort -n sorts things numerically. We could put this in a file, but then it would only ever sort a list of .pdb files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use $1, $2, and so on because we don’t know how many files there are. Instead, we use the special variable $@, which means, “All of the command-line arguments to the shell script.” We also should put $@ inside double-quotes to handle the case of arguments containing spaces (“$@” is equivalent to “$1″”$2” …) Here’s an example:

$ nano sorted.sh

# Sort filenames by their length.

# Usage: bash sorted.sh one_or_more_filenames

wc -l “$@” | sort -n

$ bash sorted.sh *.pdb ../creatures/*.dat

9 methane.pdb

12 ethane.pdb

15 propane.pdb

20 cubane.pdb

21 pentane.pdb

30 octane.pdb

163 ../creatures/basilisk.dat

163 ../creatures/unicorn.dat

 

Nelle’s Pipeline: Creating a Script

Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script. She runs the editor and writes the following:

# Calculate stats for data files.

for datafile in “$@”

do

    echo $datafile

bash goostats $datafile stats-$datafile

done

 

She saves this in a file called do-stats.sh so that she can now re-do the first stage of her analysis by typing:

$ bash do-stats.sh NENE*[AB].txt

 

She can also do this:

$ bash do-stats.sh NENE*[AB].txt | wc -l

 

so that the output is just the number of files processed rather than the names of the files that were processed.

One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:

# Calculate stats for Site A and Site B data files.

for datafile in NENE*[AB].txt

do

    echo $datafile

bash goostats $datafile stats-$datafile

done

 

The advantage is that this always selects the right files: she doesn’t have to remember to exclude the ‘Z’ files. The disadvantage is that it always selects just those files — she can’t run it on all files (including the ‘Z’ files), or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing, without editing the script. If she wanted to be more adventurous, she could modify her script to check for command-line arguments, and use NENE*[AB].txt if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book