Data analysis Hadley Wickham style

magrittr: Pipes in R
dplyr: Easy data frame manipulation
tidyr: From wide to long data (replacing reshape2)
Reading and storing big data
Other packages from the Hadleyverse
Sources and more blog posts

magrittr: Pipes in R

Magrittr introduces the pipe operator %>% which resembles the UNIX style pipe operator , e.g. in grep ERROR logfile.txt | head.

With it, you get rid of the annoying inside-out wrapping of functions, e.g.

library(babynames)
library(dplyr)
 
mydata <- summarize(group_by(filter(babynames, substr(babynames$name, 1, 3)=="Ste"), year, sex), total=sum(n))
 
qplot(year, total, color=sex, data=mydata, geom="line") + ggtitle('Names starting with "Ste"')

becomes

babynames %>%
    filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
    group_by(year, sex) %>%
    summarize(total = sum(n)) %>%
    qplot(year, total, color = sex, data = ., geom = "line") %>%
    add(ggtitle('Names starting with "Ste"')) %>%
    print

The current development version on github introduces several other operators:

%T> is a tee operator. Compare to UNIX’s tee command. It returns the left hand side after applying the right-hand side.
%$% exposes the data frame on the left to the expressions on the right (so you can omit the dataset$ in front of 1000 variables)
%<>% Works like %>%, but afterwards does not return the result of the whole chain, but overwrites the original symbol.
%,% could later be the same thing for functionals, i.e. to build functions out of pipe commands

dplyr: Easy data frame manipulation

dplyr is a faster, more consistent version of plyr, but focuses only on data frames (it can handle data.tables too). plyr included functions like ddply, daply, etc.

The most important functions here are:

group_by
summarise
mutate
filter
select
arrange

All functions behave similarly (first argument is data frame, result is data frame), so the magrittr pipe is perfect for chaining these commands.

Further functions not mentioned here: joins, e.g. left_join, which is a Hadley version for merge().

filter()

Select a subset of the rows. These two lines are equivalent:

filter(flights, month==1, day==1)
flights[flights$month==1 & flights$day==1, ]

It works similar to subset(), but the arguments are joined by & automatically.

slice()

Select rows by position:

slice(flights, 1:10)

arrange()

Reorder rows instead of selecting them:

arrange(flights, year, month, day)

Use desc(year) to sort descending.

select() and rename()

Select columns. Awesome: Specify ranges and/or exclusions by name, not number:

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

See ?select for details. You can use helpers like starts_with(), matches() or contains().

Rename arguments by using named arguments:

select(flights, tail_num=tailnum)

This drops all other columns. If you want to keep them, use rename:

rename(flights, tail_num=tailnum)

distinct()

Extract unique values only. Similar but faster than base::unique()

distinct(select(flights, tailnum))  # which tailnums appear in this column?

mutate() and transmute()

mutate is similar to base::transform(). It allows you to add new columns to a data frame:

mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

If you want to drop the old variables, use transmute() instead.

summarise()

Collapses a data frame into a single row:

summarise(flights,
  delay = mean(dep_delay, na.rm = TRUE))

Use any of R’s aggregation functions: min, max, mean, sum, sd, etc. Additionally, dplyr gives you n() for counting, n_distinct() for counting uniques, and first(x), last(x), and nth(x, n).

sample_n() and sample_frac()

Downsample a data frame to n observations or a specific fraction.

sample_n(flights, 10)
sample_frac(flights, 0.01)

You can use replace=TRUE for bootstrap samples and weights.

group_by()

This makes the above verbs very powerful. group_by() returns the same data.frame, but with group attributes. The other functions (most notably summarise()) now work separately on each subgroup:

iris %>%
  sample_n(20) %>%
  group_by(Species) %>%
  summarise(m.p.l=mean(Petal.Length))

## Source: local data frame [3 x 2]
## 
##      Species m.p.l
##       (fctr) (dbl)
## 1     setosa  1.32
## 2 versicolor  4.20
## 3  virginica  5.60

The verbs are affected by grouping as follows:

grouped select() is the same as ungrouped select(), excepted that retains grouping variables are always retained.
grouped arrange() orders first by grouping variables
mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x), and are described in detail in vignette("window-function").
sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
slice() extracts rows within each group.
summarise() is easy to understand and very useful, and is described in more detail below.

tidyr: From wide to long data (replacing reshape2)

A newer, better version of reshape2. Integrates with dplyr.

You use gather() instead of melt(), and spread() instead of cast().

Also you have separate() and unite() for splitting/combining column names if you have or want things like “male.control” and “female.treatment”.

An example for gather() (I mostly only use this function):

gather(iris, key="property", value="inch", -Species) %>%
  head()

##   Species     property inch
## 1  setosa Sepal.Length  5.1
## 2  setosa Sepal.Length  4.9
## 3  setosa Sepal.Length  4.7
## 4  setosa Sepal.Length  4.6
## 5  setosa Sepal.Length  5.0
## 6  setosa Sepal.Length  5.4

So you provide (or pipe in) the data frame; with key you specify the column name of the new ID variables; with value you specify the column name of the measured variable; afterwards, you supply (unquoted) a comma-separated list of all measured variables, or a list of all ID variables, prepended with a minus sign.

Reading and storing big data

The data.table package implements a child class (i.e. it’s compatible) of data.frame that speeds up many operations on it and reduces file size and the amount of implicit copying.
Use fread from the data.table package instead of read.csv, so CSV import takes only 2% of the time.
library(readr) is a Hadley package that provides simplified functions for reading data as well, e.g. read_csv().
Use the rhdf5 package to apply HDF5 to store big data sets in a compressed but easily sliceable format. This allows you to extract only a rectangular slice from your data set, if the whole thing doesn’t fit into memory or would take too long to load.

Other packages from the Hadleyverse

lubridate is a package for working with dates
stringr offers simple string manipulation
testthat and assertthat for nice testing and assertions
devtools to facilitate code developing
ggvis lets you create graphics ggplot2-style, but interactively playable in RStudio or the web, shiny-style.

I never read into the documentation of these packages, but just use them on a case-by-case basis. It’s helpful to keep in mind they exist, though.

Sources and more blog posts

Source for the magrittr part
Source for the dplyr part (but also vignette("introduction", package="dplyr"))
Source for the tidyr part
http://adolfoalvarez.cl/the-hitchhikers-guide-to-the-hadleyverse/
http://barryrowlingson.github.io/hadleyverse/#1