5 Class 6. Unix sorting, redirection, filtering with head/tail
Command line practice: sorting, redirecting, head/tail
Content modified from The Software Carpentries
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways. We’ll start with a directory called molecules that contains six files describing some simple organic molecules.
What are these files? Where did they come from?
The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
$ ls molecules
cubane.pdb ethane.pdb methane.pdb
octane.pdb pentane.pdb propane.pdb
Let’s go into that directory with cd and run the command wc *.pdb. wc is the “word count” command: it counts the number of lines, words, and characters in files (from left to right, in that order).
The * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:
$ cd molecules
$ wc *.pdb
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
If we run wc -l instead of just wc, the output shows only the number of lines per file:
$ wc -l *.pdb
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
Remember that we can also use -w to get only the number of words, or -c to get only the number of characters.
Which of these files is shortest? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:
The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution. ls lengths.txt confirms that the file exists:
$ ls lengths.txt
lengths.txt
We can now send the content of lengths.txt to the screen using cat lengths.txt. cat stands for “concatenate”: it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:
$ cat lengths.txt
20 cubane.pdb
12 ethane.pdb
9 methane.pdb
30 octane.pdb
21 pentane.pdb
15 propane.pdb
107 total
What Does sort -n do?
Create a text file called numbers and enter the following numbers.
$ nano numbers.txt
10
2
19
22
6
If we run sort on a file containing the following lines by typing
$ sort numbers.txt
the output is:
10
19
2
22
6
We will also use the -n flag to specify that the sort is numerical instead of alphabetical. This does not change the file; instead, it sends the sorted result to the screen:
$ sort -n lengths.txt
9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
Filtering a text file with the head and tail commands
We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:
$ sort -n lengths.txt > sorted-lengths.txt
$ head -n 1 sorted-lengths.txt
9 methane.pdb
Using -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines.
It’s a very bad idea to try redirecting the output of a command that operates on a file to the same file. For example:
Doing something like this may give you incorrect results and/or delete the contents of lengths.txt.
Tail follows same the format as the head command. You use –n to specify number of lines to show. For example:
This command would show you the last line of the sorted-lengths.txt file.
What Does >> Mean?
We have seen the use of >, but there is a similar operator >> which works slightly differently. By using the echo command to print strings, test the commands below to reveal the difference between the two operators:
$ echo hello > testfile01.txt
$ echo hello >> testfile02.txt
Hint: Try executing each command twice in a row and then examining the output files.
Using the >> operator will append data at the end of the file, while using the > will overwrite the contents of the file if already existing.
Command review from this tutorial:
cat
displays the contents of its inputs.head
displays the first 10 lines of its input. You can adjust the # of lines to display with the -n option.tail
displays the last 10 lines of its input. You can adjust the # of lines to display with the -n option.sort
sorts its inputs.wc
counts lines, words, and characters in its inputs.command > file
redirects a command’s output to a file (overwriting any existing content).