11 Class 19. R data analysis

Data analysis in R tutorial

*Exercise adapted from the Software-Carpentries: CC-BY

Looking at Metadata

We are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations. This metadata describes information on the Ara-3 clones and the columns represent:

Column Description
sample clone name
generation generation when sample frozen
clade based on parsimony-based tree
strain ancestral strain
cit citrate-using mutant status
run Sequence read archive sample ID
genome_size size in Mbp (made up data for this lesson)

 

In R-studio, click Import dataset –> from text (base) in the top right corner and upload the metadata table I gave you in today’s folder.

 

Let’s check the top (the first 6 lines) of this data.frame using the function head():

head(metadata)

We’ve just done two very useful things. 1. We’ve read our data in to R, so now we can work with it in R 2. We’ve created a data frame (with the read.csv command) the standard way R works with data.

What are data frames?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.A data.frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str()function is useful to inspect the data types of the columns.

A data.frame can be created by the functions read.csv() or read.table() to import spreadsheets from your hard drive (or the web).

By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE:

Let’s now check the structure of this data.frame in more details with the function str():

str(metadata)

Inspecting data.frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

  • Size:
    • dim() – returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow() – returns the number of rows
    • ncol() – returns the number of columns
  • Content:
    • head() – shows the first 6 rows
    • tail() – shows the last 6 rows
  • Names:
    • names() – returns the column names (synonym of colnames() for data.frame objects)
    • rownames() – returns the row names
  • Summary:
    • str() – structure of the object and information about the class, length and content of each column
    • summary() – summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

In the data frame we just imported, let’s do

str(metadata)## ‘data.frame’:    30 obs. of  7 variables:##  $ sample     : Factor w/ 30 levels “CZB152″,”CZB154”,..: 7 6 18 19 20 21 22 23 24 25 …##  $ generation : int  0 2000 5000 10000 15000 20000 20000 20000 25000 25000 …##  $ clade      : Factor w/ 7 levels “(C1,C2)”,”C1″,..: NA 7 7 6 6 1 1 1 2 4 …##  $ strain     : Factor w/ 1 level “REL606”: 1 1 1 1 1 1 1 1 1 1 …##  $ cit        : Factor w/ 3 levels “minus”,”plus”,..: 3 3 3 3 3 3 3 3 3 3 …##  $ run        : Factor w/ 30 levels “”,”SRR097977″,..: 1 5 22 23 24 25 26 27 28 29 …##  $ genome_size: num  4.62 4.63 4.6 4.59 4.66 4.63 4.62 4.61 4.65 4.59 …

We can see the names of the multiple columns. And, we see that some say things like Factor w/ 30 levels.

When we read in a file, any column that contains text is automatically assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.

For instance, we see that cit is a Factor w/ 3 levels, minus, plus and unknown.

Indexing and sequences

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math.

R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Our metadata data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

metadata[1, 2]   # first element in the 2nd column of the data frame

metadata[1, 6]   # first element in the 6th column

metadata[1:3, 7] # first three elements in the 7th column

metadata[3, ]    # the 3rd element for all columns

metadata[, 7]    # the entire 7th column

head_meta <- metadata[1:6, ] # metadata[1:6, ] is equivalent to head(metadata)

For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, the column of the variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and understand.

You can do operations on a particular column by selecting it using the $ sign. In this case, the entire column is a vector. You can use names(metadata) or colnames(metadata) to remind yourself of the column names. For instance, to extract all the strain information from our datasets, type:

metadata$strain

In some cases, you may way to select more than one column. You can do this using the square brackets. Suppose we wanted strain and clade information:

metadata[, c(“strain”, “clade”)]

You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:

metadata[4:7, c(“strain”, “clade”)]

 

Using the dplyr package for data analysis

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated operations. dplyr is a package for making data manipulation easier! Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions.

You already installed the tidyverse package that contains the dplyr program. Now we just need to load it to be able to use it.

library(“tidyverse”)          ## load

You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

What is dplyr?

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.

This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.

Selecting columns and filtering rows

We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarize(). To select columns of a data frame, use select(). The first argument to this function is the data frame (metadata), and the subsequent arguments are the columns to keep.

select(metadata, sample, clade, cit, genome_size)

To choose rows, use filter():

filter(metadata, cit == “plus”)

Pipes

But what if you wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>% and are made available via the magrittr package installed as part of dplyr.

metadata %>%  filter(cit == “plus”) %>%  select(sample, generation, clade)

In the above we use the pipe to send the metadata data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns. When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore.

If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:

meta_citplus <- metadata %>%  filter(cit == “plus”) %>%  select(sample, generation, clade)

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book