5 Class 6. Unix sorting, redirection, filtering with head/tail

Command line practice: sorting, redirecting, head/tail

Content modified from The Software Carpentries

For these tutorials, make sure you have download and unzipped the datashell folder available on Canvas or here. 

Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways. We’ll start with a directory called molecules that contains six files describing some simple organic molecules.

What are these files? Where did they come from?

The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

$ ls molecules

cubane.pdb    ethane.pdb    methane.pdb

octane.pdb    pentane.pdb   propane.pdb

 

Let’s go into that directory with cd and run the command wc *.pdb. wc is the “word count” command: it counts the number of lines, words, and characters in files (from left to right, in that order).

The * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:

$ cd molecules

$ wc *.pdb

20  156  1158  cubane.pdb

12  84   622   ethane.pdb

9  57   422   methane.pdb

30  246  1828  octane.pdb

21  165  1226  pentane.pdb

15  111  825   propane.pdb

107  819  6081  total

 

If we run wc -l instead of just wc, the output shows only the number of lines per file:

$ wc -l *.pdb

20  cubane.pdb

12  ethane.pdb

9  methane.pdb

30  octane.pdb

21  pentane.pdb

15  propane.pdb

107  total

 

Remember that we can also use -w to get only the number of words, or -c to get only the number of characters.

Which of these files is shortest? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:

$ wc -l *.pdb > lengths.txt

 

The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution. ls lengths.txt confirms that the file exists:

$ ls lengths.txt

lengths.txt

 

We can now send the content of lengths.txt to the screen using cat lengths.txt. cat stands for “concatenate”: it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:

$ cat lengths.txt

20  cubane.pdb

12  ethane.pdb

9  methane.pdb

30  octane.pdb

21  pentane.pdb

15  propane.pdb

107  total

 

What Does sort -n do?

Create a text file called numbers and enter the following numbers.

$ nano numbers.txt

10

2

19

22

6

 

If we run sort on a file containing the following lines by typing

$ sort numbers.txt

the output is:

10

19

2

22

6

 

We will also use the -n flag to specify that the sort is numerical instead of alphabetical. This does not change the file; instead, it sends the sorted result to the screen:

$ sort -n lengths.txt

9  methane.pdb

12  ethane.pdb

15  propane.pdb

20  cubane.pdb

21  pentane.pdb

30  octane.pdb

107  total

Filtering a text file with the head and tail commands

We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:

$ sort -n lengths.txt > sorted-lengths.txt

$ head -n 1 sorted-lengths.txt

9  methane.pdb

 

Using -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines.

It’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. For example:

$ sort -n lengths.txt > lengths.txt

 

Doing something like this may give you incorrect results and/or delete the contents of lengths.txt.

Tail follows same the format as the head command. You use –n to specify number of lines to show. For example:

$ tail -n 1 sorted-lengths.txt

This command would show you the last line of the sorted-lengths.txt file.

What Does >> Mean?

We have seen the use of >, but there is a similar operator >> which works slightly differently. By using the echo command to print strings, test the commands below to reveal the difference between the two operators:

$ echo hello > testfile01.txt

$ echo hello >> testfile02.txt

Hint: Try executing each command twice in a row and then examining the output files.

Using the >> operator will append data at the end of the file, while using the > will overwrite the contents of the file if already existing.

Command review from this tutorial:

  • cat displays the contents of its inputs.
  • head displays the first 10 lines of its input. You can adjust the # of lines to display with the -n option.
  • tail displays the last 10 lines of its input. You can adjust the # of lines to display with the -n option.
  • sort sorts its inputs.
  • wc counts lines, words, and characters in its inputs.
  • command > file redirects a command’s output to a file (overwriting any existing content).

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book