Data analysis Hadley Wickham style
Table of contents
- magrittr: Pipes in R
- dplyr: Easy data frame manipulation
- tidyr: From wide to long data (replacing reshape2)
- Reading and storing big data
- Other packages from the Hadleyverse
- Sources and more blog posts
magrittr: Pipes in R
Magrittr introduces the pipe operator %>% which resembles the UNIX style pipe operator |
, e.g. in grep ERROR logfile.txt | head. |
With it, you get rid of the annoying inside-out wrapping of functions, e.g.
library(babynames)
library(dplyr)
mydata <- summarize(group_by(filter(babynames, substr(babynames$name, 1, 3)=="Ste"), year, sex), total=sum(n))
qplot(year, total, color=sex, data=mydata, geom="line") + ggtitle('Names starting with "Ste"')becomes
babynames %>%
filter(name %>% substr(1, 3) %>% equals("Ste")) %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
qplot(year, total, color = sex, data = ., geom = "line") %>%
add(ggtitle('Names starting with "Ste"')) %>%
printThe current development version on github introduces several other operators:
%T>is a tee operator. Compare to UNIX’s tee command. It returns the left hand side after applying the right-hand side.%$%exposes the data frame on the left to the expressions on the right (so you can omit the dataset$ in front of 1000 variables)%<>%Works like%>%, but afterwards does not return the result of the whole chain, but overwrites the original symbol.%,%could later be the same thing for functionals, i.e. to build functions out of pipe commands
dplyr: Easy data frame manipulation
dplyr is a faster, more consistent version of plyr, but focuses only on data frames (it can handle data.tables too). plyr included functions like ddply, daply, etc.
The most important functions here are:
group_bysummarisemutatefilterselectarrange
All functions behave similarly (first argument is data frame, result is data frame), so the magrittr pipe is perfect for chaining these commands.
Further functions not mentioned here: joins, e.g. left_join, which is a Hadley version for merge().
filter()
Select a subset of the rows. These two lines are equivalent:
filter(flights, month==1, day==1)
flights[flights$month==1 & flights$day==1, ]It works similar to subset(), but the arguments are joined by & automatically.
slice()
Select rows by position:
slice(flights, 1:10)arrange()
Reorder rows instead of selecting them:
arrange(flights, year, month, day)Use desc(year) to sort descending.
select() and rename()
Select columns. Awesome: Specify ranges and/or exclusions by name, not number:
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))See ?select for details. You can use helpers like starts_with(), matches() or contains().
Rename arguments by using named arguments:
select(flights, tail_num=tailnum)This drops all other columns. If you want to keep them, use rename:
rename(flights, tail_num=tailnum)distinct()
Extract unique values only. Similar but faster than base::unique()
distinct(select(flights, tailnum)) # which tailnums appear in this column?mutate() and transmute()
mutate is similar to base::transform(). It allows you to add new columns to a data frame:
mutate(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
)If you want to drop the old variables, use transmute() instead.
summarise()
Collapses a data frame into a single row:
summarise(flights,
delay = mean(dep_delay, na.rm = TRUE))Use any of R’s aggregation functions: min, max, mean, sum, sd, etc. Additionally, dplyr gives you n() for counting, n_distinct() for counting uniques, and first(x), last(x), and nth(x, n).
sample_n() and sample_frac()
Downsample a data frame to n observations or a specific fraction.
sample_n(flights, 10)
sample_frac(flights, 0.01)You can use replace=TRUE for bootstrap samples and weights.
group_by()
This makes the above verbs very powerful. group_by() returns the same data.frame, but with group attributes. The other functions (most notably summarise()) now work separately on each subgroup:
iris %>%
sample_n(20) %>%
group_by(Species) %>%
summarise(m.p.l=mean(Petal.Length))## Source: local data frame [3 x 2]
##
## Species m.p.l
## (fctr) (dbl)
## 1 setosa 1.32
## 2 versicolor 4.20
## 3 virginica 5.60The verbs are affected by grouping as follows:
- grouped
select()is the same as ungroupedselect(), excepted that retains grouping variables are always retained. - grouped
arrange()orders first by grouping variables mutate()andfilter()are most useful in conjunction with window functions (likerank(), ormin(x) == x), and are described in detail invignette("window-function").sample_n()andsample_frac()sample the specified number/fraction of rows in each group.slice()extracts rows within each group.summarise()is easy to understand and very useful, and is described in more detail below.
tidyr: From wide to long data (replacing reshape2)
A newer, better version of reshape2. Integrates with dplyr.
You use gather() instead of melt(), and spread() instead of cast().
Also you have separate() and unite() for splitting/combining column names if you have or want things like “male.control” and “female.treatment”.
An example for gather() (I mostly only use this function):
gather(iris, key="property", value="inch", -Species) %>%
head()## Species property inch
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Length 4.9
## 3 setosa Sepal.Length 4.7
## 4 setosa Sepal.Length 4.6
## 5 setosa Sepal.Length 5.0
## 6 setosa Sepal.Length 5.4So you provide (or pipe in) the data frame; with key you specify the column name of the new ID variables; with value you specify the column name of the measured variable; afterwards, you supply (unquoted) a comma-separated list of all measured variables, or a list of all ID variables, prepended with a minus sign.
Reading and storing big data
- The
data.tablepackage implements a child class (i.e. it’s compatible) of data.frame that speeds up many operations on it and reduces file size and the amount of implicit copying. - Use
freadfrom thedata.tablepackage instead ofread.csv, so CSV import takes only 2% of the time. library(readr)is a Hadley package that provides simplified functions for reading data as well, e.g.read_csv().- Use the
rhdf5package to apply HDF5 to store big data sets in a compressed but easily sliceable format. This allows you to extract only a rectangular slice from your data set, if the whole thing doesn’t fit into memory or would take too long to load.
Other packages from the Hadleyverse
lubridateis a package for working with datesstringroffers simple string manipulationtestthatand assertthat for nice testing and assertionsdevtoolsto facilitate code developingggvislets you create graphics ggplot2-style, but interactively playable in RStudio or the web, shiny-style.
I never read into the documentation of these packages, but just use them on a case-by-case basis. It’s helpful to keep in mind they exist, though.
Sources and more blog posts
- Source for the magrittr part
- Source for the dplyr part (but also
vignette("introduction", package="dplyr")) - Source for the tidyr part
- http://adolfoalvarez.cl/the-hitchhikers-guide-to-the-hadleyverse/
- http://barryrowlingson.github.io/hadleyverse/#1


