12 Class 20. Data analysis in dplyr: mutate, summarize, group_by

Last time you learned how to use the dplyr package to select data, filter data, and use pipes to link these commands together!

Today we’ll learn how to create new columns by performing calculations on existing data using an incredibly useful dplyr command called MUTATE.

Content below adapted from the Software Carpentries.

Mutate to add new columns of calculations

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().

To create a new column of genome size in bp:

metadata %>%
mutate(genome_bp = genome_size *1e6)

 

If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data (pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded).

metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
head

 

Write the command to create a new column called genome_ratio in the metadata table that calculates the ratio of each sample’s genome to the parental E. coli in row 1.

metadata %>%  mutate(genome_ratio = genome_size/4.62) OR
metadata %>%  mutate(genome_ratio = genome_size/metadata[1,7])

 

You can do T/F tests on your data. We can create new columns containing the output of these tests. These calculations can be used to classify and group our samples.

Let’s categorize the samples into big vs small genomes

metadata %>%mutate(genome_big= genome_size > median(genome_size))

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group. For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:

metadata %>%
group_by(cit) %>%
summarize(n())

Here the summary function used was n() to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean, median, min, and max. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).

So to view mean genome_size by mutant status:

metadata %>%
group_by(cit) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))

You can group by multiple columns too:

metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))

 

Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():

metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))

 

All of a sudden this isn’t running of the screen anymore. That’s because dplyr has changed our data.frame to a tbl_df. This is a data structure that’s very similar to a data frame; for our purposes the only difference is that it won’t automatically show tons of data going off the screen.

You can also summarize multiple variables at the same time:

metadata %>%
group_by(cit, clade)
%>%  summarize(mean_size = mean(genome_size, na.rm = TRUE), min_generation = min(generation))

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns

arrange(metadata, genome_size)

Use desc() to re-order by a column in descending order:

arrange(metadata, desc(generation))

Arrange and break tie with second column

arrange(metadata, genome_size, cit )

 

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book