13 Class 23. Plotting in R

Basic plots in R

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data.

When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as hisotgrams, scatter plots, bar charts, boxplots and much more.

Scatterplot

Let’s start with a scatterplot. A scatter plot provides a graphical view of the relationship between two sets of numbers. Let’s look and see if there is a relationship between generation and genome_size.

plot(metadata$generation, metadata$genome_size)

Each point represents a clone and the value on the x-axis is the generation, and the values on the y-axis correspond to the genome size for the clone. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options For example, we can change the shape of the data point using pch.

plot(metadata$generation, metadata$genome_size, pch=8)

We can add a title to the plot by assigning a string to main:

plot(metadata$generation, metadata$genome_size, pch=8, main=”Genome size versus generation”)

Histogram

Another way to visualize the distribution of genome sizes is to use a histogram, we can do this by using the hist function.

hist(metadata$genome_size)

Boxplot

Using additional information from our metadata, we can use plots to compare values between the different citrate mutant status using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.

# Boxplot

boxplot(genome_size ~ cit, metadata)

 

Similar to the scatterplots above, we can pass in arguments to add in extras like plot title, axis labels and colors.

boxplot(genome_size ~ cit, metadata,  col=c(“pink”,”purple”, “darkgrey”),

main=”Genome size across different cell-types”, ylab=”Genome size”)

GGPLOT2

Many R users have moved away from base graphic options and towards a plotting package called ggplot2 that adds a lot of functionality to basic plots. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the basic plots you created, but using ggplot functions to get a feel for the syntax.

ggplot is best used on data in the data.frame form, so we will work with metadata for the following figures. Let’s start by loading the ggplot2 library.

library(ggplot2)

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the +operator.

We will start with a blank plot and will find that you will get an error, because you need to add layers.

ggplot(metadata) # note the error

Geometric objects are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for boxplots!)

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator. Each type of geom usually has a required set of aesthetics to be set, and usually accepts only a subset of all aesthetics –refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function. Aesthetic examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color) shape (of points)
  • linetype
  • size

To start, we will add position for the x- and y-axis since geom_point requires mappings for x and y, all others are optional.

ggplot(metadata) + geom_point(aes(x = sample, y= genome_size))

The labels on the x-axis are hard to read. To fix this, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis labels
  • Plot background
  • Facet label background
  • Legend appearance

There are built-in themes we can use, or we can adjust specific elements. For our figure we will change the x-axis labels to be plotted on a 45 degree angle with a small horizontal shift to avoid overlap. We will also add some additional aesthetics by mapping them to other variables in our dataframe. For example, the color of the points will reflect the number of generations and the shape will reflect citrate mutant status. The size of the points can be adjusted within the geom_point but does not need to be included in aes() since the value is not mapping to a variable.

#here is an example of how to adjust your graph

ggplot(metadata) +

geom_point(aes(x = sample, y= genome_size, color = generation, shape = cit), size = rel(3.0)) +

theme(axis.text.x = element_text(angle=45, hjust=1))

How to build a histogram in ggplot2

ggplot(metadata) +

geom_histogram(aes(x = genome_size))

Try plotting with the default value and compare it to a plot where you change the binwidth values.

ggplot(metadata) +

geom_histogram(aes(x = genome_size), binwidth=0.05)

Frequency distributions

frequency polygons (geom_freqpoly()) display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across the levels of a categorical variable. Let’s compare the genome sizes of cit+, cit-, and unknown cells:

ggplot(metadata) +

  geom_freqpoly(aes(x = genome_size,colour=cit), binwidth = 0.005)

Sometimes it’s easier to separate your data into multiple graphs to make your comparisons easier. One of the MOST USEFUL aspects of GGPLOT2 is how it allows you to separate data by a variable using the facet_wrap() function.

# let’s build a graph to evaluate genome sizes across cit+/cit- cells

ggplot(metadata) +

  facet_wrap(~cit) +

  geom_histogram(aes(x=genome_size, colour=cit, fill=cit)) +

  theme(axis.text.x = element_text(angle=45, hjust=1))

 

ggplot(metadata) +

  facet_wrap(~clade) +

  geom_point(aes(x= generation, y=genome_size, colour=cit, fill=cit))

Boxplot

Now that we have all the required information, let’s try plotting a boxplot similar to what we had done using the base plot functions. We can add some additional layers to include a plot title and change the axis labels. Explore the code below and notice the different layers that we have added to understand what each layer contributes to the final graphic.

ggplot(metadata) +

geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) +

ggtitle(‘Boxplot of genome size by citrate mutant type’) +

xlab(‘citrate mutant’) +

ylab(‘genome size’) +

theme(panel.grid.major = element_line(size = .5, color = “grey”),

axis.text.x = element_text(angle=45, hjust=1),

axis.title = element_text(size = rel(1.5)),

axis.text = element_text(size = rel(1.25)))

How to make a column graph in ggplot

ggplot(metadata) +

geom_col(aes(x= sample, y= genome_size, fill=cit)) +

theme(axis.text.x = element_text(angle=45, vjust = 0.5)) +

ggtitle(‘Citrate status vs genome size’) +

xlab(‘citrate mutant status’) +

ylab(‘genome size in Mb’)

Writing figures to file

There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen). The first (and easiest) is to export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and you will select the directory to which you wish to save it to. The second option is to use R functions in the console, allowing you the flexibility to specify parameters to dictate the size and resolution of the output image. Some of the more popular formats include pdf() and png. Initialize a plot that will be written directly to a file using pdf, png etc. Within the function you will need to specify a name for your image, and the width and height (optional). Then create a plot using the usual functions in R. Finally, close the file using the dev.off() function. There are also bmp, tiff, and jpeg functions, though the jpeg function has proven less stable than the others.

pdf(“figure/boxplot.pdf”)

ggplot(example_data) +

geom_boxplot(aes(x = cit, y =….) +

ggtitle(…) +

xlab(…) +

ylab(…) +

theme(panel.grid.major = element_line(…),

axis.text.x = element_text(…),

axis.title = element_text(…),

axis.text = element_text(…)

dev.off()

All graphing content from The Software carpentry, CC-BY

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

BIOL446/BIOL546 Bioinformatics Coding Guides Copyright © by emilymeredith is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book