Chapter 2: Getting started with ggplot2

2.1 Introduction

2.2 Fuel Economy Data

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
mpg <- mpg

2.3: Key Components

Data: mpg

Aesthetic Mapping: engine size (x) to fuel (y)

Layer: points

ggplot(mpg, aes(displ, hwy)) +
  geom_point()

2.4: Color, Size, Shape and Other Aesthetic Attributes

  • Color and shape work well with categorical variables
  • Size works well for continuous variables
  • If too much data, faceting can help
  • “When using aesthetics in a plot, less is usually more”
  • “Instead of trying to make one very complex plot that shows everything at once, see if you can create a series of simple plots that tell a story, leading the reader from ignorance to knowledge”
ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point()

ggplot(mpg, aes(displ, hwy, shape = drv)) +
  geom_point()

ggplot(mpg, aes(displ, hwy, size = cyl)) +
  geom_point()

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = "blue"))

ggplot(mpg, aes(displ, hwy)) +
  geom_point(color = "blue")

2.5: Faceting

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~class)

2.6: Plot Geoms

  • geom_smooth()
    • fits smoother to data
  • geom_boxplot()
    • box and whisker
  • geom_histogram() & geom_freepoly()
    • shows distribution of continuous variables
  • geom_bar()
    • shows distribution of categorical variables
  • geom_path() & geom_line()
    • shows line of variables, usually over time

6.2.1 Smoothers

ggplot(mpg, aes(displ, hwy)) +
  geom_smooth() 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth() 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

### 6.2.2 Boxplots and jittered points

ggplot(mpg, aes(drv, hwy)) + 
  geom_point()

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter()

ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()

ggplot(mpg, aes(drv, hwy)) + 
  geom_violin()

### 2.6.3 Histograms and Frequency Polygons * These show the distribution of a single numeric variable * They provide more information about a single group * pick a different bin with binwidth

ggplot(mpg, aes(hwy)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) +
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(displ, color= drv)) +
  geom_freqpoly(binwidth = 0.5)

ggplot(mpg, aes(displ, fill = drv)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~drv, ncol = 1)

2.6.4 Bar Charts

  • Two types of bar charts
  • First below expects unsumarised data and each observation contributes one unit to the height of the bar
  • Second type expects presummarised data NOTE: You need to tell geom_bar() to not run the default stat which bins and counts the data
ggplot(mpg, aes(manufacturer)) + 
  geom_bar()

drugs <- data.frame(
  drug = c("a", "b", "c"),
  effect = c(4.2,9.7,5.1)
)
ggplot(drugs, aes(drug, effect)) +
  geom_bar(stat = "identity")

  • Geom point is often better in this case because it takes less space and doesn’t require the y axis to have 0
ggplot(drugs, aes(drug, effect)) +
  geom_point()

## 2.6.5 Time Series with Line and Path Plots * Line plots: + join points from left to right + Have time on x axis * Path plots: j + join points in the order they appear in the dataset + time is included

economics <- economics
ggplot(economics, aes(date, unemploy / pop)) + 
  geom_line()

ggplot(economics, aes(date, uempmed)) + 
  geom_line()

Plotting unemployment rate vs length of unemployment (and join the individual observations with a path)

ggplot(economics, aes(unemploy / pop, uempmed)) +
  geom_path() +
  geom_point()

Adding color for clarity

year <- function(x) as.POSIXlt(x)$year + 1900
ggplot(economics, aes(unemploy / pop, uempmed)) +
  geom_path(color = "grey50") +
  geom_point(aes(color = year(date)))

## 2.7 Modifying the Axes

Modifying the labels

xlab(‘x label’)

ylab(‘y label’)

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3) 

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3) + 
  xlab("city driving(mpg)") +
  ylab("highway driving (mpg)")

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3) + 
  xlab(NULL) +
  ylab(NULL)

### Modify the limits of the axis: #### xlim(‘x label’) #### ylim(‘y label’)

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25) 

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25) +
  xlim("f", "r") +
  ylim(20, 30)
## Warning: Removed 137 rows containing missing values (geom_point).

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25, na.rm = TRUE) +
  ylim(NA, 30)

You can get rid of the warning with na.rm = TRUE.

2.8 Outputs

  • To manipulate the plot, save it as an object
  • then render it using print()
  • NOTE: When using plots in a loop, print(p) must be used
  • save it do disk with ggsave()
  • summarize it with summary()
  • save it as is with saveRDS() and read that back with readRDS()
p <- ggplot(mpg, aes(displ, hwy, color = factor(cyl))) + 
  geom_point()

print(p)

ggsave('plot.png', width = 5, height = 5)
summary(p)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy,
##   fl, class [234x11]
## mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
saveRDS(p, 'plot.rds')
q <- readRDS('plot.rds')

2.9 Quick Plots

  • qplot() lets you define a plot in a single call, picking a geom by default if you don’t supply one
qplot(displ, hwy, data = mpg)

qplot(displ, data = mpg)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • if you give qplot() x and y variables, you get a scatter plot
  • if you give just an x, it will create a histogram or bar chart
  • qplot() assumes all variables should be scaled by default
  • to set an aesthetic to a constant, use I():
qplot(displ, hwy, data = mpg, color = "blue")

qplot(displ, hwy, data = mpg, color = I("blue"))

WHEEE!!! THE END!!!!!!