Table of contents

  1. magrittr: Pipes in R
  2. dplyr: Easy data frame manipulation
  3. tidyr: From wide to long data (replacing reshape2)
  4. Reading and storing big data
  5. Other packages from the Hadleyverse
  6. Sources and more blog posts

magrittr: Pipes in R

Magrittr introduces the pipe operator %>% which resembles the UNIX style pipe operator , e.g. in grep ERROR logfile.txt | head.

With it, you get rid of the annoying inside-out wrapping of functions, e.g.

library(babynames)
library(dplyr)
 
mydata <- summarize(group_by(filter(babynames, substr(babynames$name, 1, 3)=="Ste"), year, sex), total=sum(n))
 
qplot(year, total, color=sex, data=mydata, geom="line") + ggtitle('Names starting with "Ste"')

becomes

babynames %>%
    filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
    group_by(year, sex) %>%
    summarize(total = sum(n)) %>%
    qplot(year, total, color = sex, data = ., geom = "line") %>%
    add(ggtitle('Names starting with "Ste"')) %>%
    print

The current development version on github introduces several other operators:

  • %T> is a tee operator. Compare to UNIX’s tee command. It returns the left hand side after applying the right-hand side.
  • %$% exposes the data frame on the left to the expressions on the right (so you can omit the dataset$ in front of 1000 variables)
  • %<>% Works like %>%, but afterwards does not return the result of the whole chain, but overwrites the original symbol.
  • %,% could later be the same thing for functionals, i.e. to build functions out of pipe commands

dplyr: Easy data frame manipulation

dplyr is a faster, more consistent version of plyr, but focuses only on data frames (it can handle data.tables too). plyr included functions like ddply, daply, etc.

The most important functions here are:

  • group_by
  • summarise
  • mutate
  • filter
  • select
  • arrange

All functions behave similarly (first argument is data frame, result is data frame), so the magrittr pipe is perfect for chaining these commands.

Further functions not mentioned here: joins, e.g. left_join, which is a Hadley version for merge().

filter()

Select a subset of the rows. These two lines are equivalent:

filter(flights, month==1, day==1)
flights[flights$month==1 & flights$day==1, ]

It works similar to subset(), but the arguments are joined by & automatically.

slice()

Select rows by position:

slice(flights, 1:10)

arrange()

Reorder rows instead of selecting them:

arrange(flights, year, month, day)

Use desc(year) to sort descending.

select() and rename()

Select columns. Awesome: Specify ranges and/or exclusions by name, not number:

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

See ?select for details. You can use helpers like starts_with(), matches() or contains().

Rename arguments by using named arguments:

select(flights, tail_num=tailnum)

This drops all other columns. If you want to keep them, use rename:

rename(flights, tail_num=tailnum)

distinct()

Extract unique values only. Similar but faster than base::unique()

distinct(select(flights, tailnum))  # which tailnums appear in this column?

mutate() and transmute()

mutate is similar to base::transform(). It allows you to add new columns to a data frame:

mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

If you want to drop the old variables, use transmute() instead.

summarise()

Collapses a data frame into a single row:

summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE))

Use any of R’s aggregation functions: min, max, mean, sum, sd, etc. Additionally, dplyr gives you n() for counting, n_distinct() for counting uniques, and first(x), last(x), and nth(x, n).

sample_n() and sample_frac()

Downsample a data frame to n observations or a specific fraction.

sample_n(flights, 10)
sample_frac(flights, 0.01)

You can use replace=TRUE for bootstrap samples and weights.

group_by()

This makes the above verbs very powerful. group_by() returns the same data.frame, but with group attributes. The other functions (most notably summarise()) now work separately on each subgroup:

iris %>%
  sample_n(20) %>%
  group_by(Species) %>%
  summarise(m.p.l=mean(Petal.Length))
## Source: local data frame [3 x 2]
## 
##      Species m.p.l
##       (fctr) (dbl)
## 1     setosa  1.32
## 2 versicolor  4.20
## 3  virginica  5.60

The verbs are affected by grouping as follows:

  • grouped select() is the same as ungrouped select(), excepted that retains grouping variables are always retained.
  • grouped arrange() orders first by grouping variables
  • mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x), and are described in detail in vignette("window-function").
  • sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
  • slice() extracts rows within each group.
  • summarise() is easy to understand and very useful, and is described in more detail below.

tidyr: From wide to long data (replacing reshape2)

A newer, better version of reshape2. Integrates with dplyr.

You use gather() instead of melt(), and spread() instead of cast().

Also you have separate() and unite() for splitting/combining column names if you have or want things like “male.control” and “female.treatment”.

An example for gather() (I mostly only use this function):

gather(iris, key="property", value="inch", -Species) %>%
  head()
##   Species     property inch
## 1  setosa Sepal.Length  5.1
## 2  setosa Sepal.Length  4.9
## 3  setosa Sepal.Length  4.7
## 4  setosa Sepal.Length  4.6
## 5  setosa Sepal.Length  5.0
## 6  setosa Sepal.Length  5.4

So you provide (or pipe in) the data frame; with key you specify the column name of the new ID variables; with value you specify the column name of the measured variable; afterwards, you supply (unquoted) a comma-separated list of all measured variables, or a list of all ID variables, prepended with a minus sign.

Reading and storing big data

  • The data.table package implements a child class (i.e. it’s compatible) of data.frame that speeds up many operations on it and reduces file size and the amount of implicit copying.
  • Use fread from the data.table package instead of read.csv, so CSV import takes only 2% of the time.
  • library(readr) is a Hadley package that provides simplified functions for reading data as well, e.g. read_csv().
  • Use the rhdf5 package to apply HDF5 to store big data sets in a compressed but easily sliceable format. This allows you to extract only a rectangular slice from your data set, if the whole thing doesn’t fit into memory or would take too long to load.

Other packages from the Hadleyverse

  • lubridate is a package for working with dates
  • stringr offers simple string manipulation
  • testthat and assertthat for nice testing and assertions
  • devtools to facilitate code developing
  • ggvis lets you create graphics ggplot2-style, but interactively playable in RStudio or the web, shiny-style.

I never read into the documentation of these packages, but just use them on a case-by-case basis. It’s helpful to keep in mind they exist, though.

Sources and more blog posts