Data analysis Hadley Wickham style
Table of contents
- magrittr: Pipes in R
- dplyr: Easy data frame manipulation
- tidyr: From wide to long data (replacing reshape2)
- Reading and storing big data
- Other packages from the Hadleyverse
- Sources and more blog posts
magrittr: Pipes in R
Magrittr introduces the pipe operator %>% which resembles the UNIX style pipe operator |
, e.g. in grep ERROR logfile.txt | head . |
With it, you get rid of the annoying inside-out wrapping of functions, e.g.
becomes
The current development version on github introduces several other operators:
%T>
is a tee operator. Compare to UNIX’s tee command. It returns the left hand side after applying the right-hand side.%$%
exposes the data frame on the left to the expressions on the right (so you can omit the dataset$ in front of 1000 variables)%<>%
Works like%>%
, but afterwards does not return the result of the whole chain, but overwrites the original symbol.%,%
could later be the same thing for functionals, i.e. to build functions out of pipe commands
dplyr: Easy data frame manipulation
dplyr is a faster, more consistent version of plyr, but focuses only on data frames (it can handle data.tables too). plyr included functions like ddply, daply, etc.
The most important functions here are:
group_by
summarise
mutate
filter
select
arrange
All functions behave similarly (first argument is data frame, result is data frame), so the magrittr pipe is perfect for chaining these commands.
Further functions not mentioned here: joins, e.g. left_join
, which is a Hadley version for merge()
.
filter()
Select a subset of the rows. These two lines are equivalent:
It works similar to subset()
, but the arguments are joined by & automatically.
slice()
Select rows by position:
arrange()
Reorder rows instead of selecting them:
Use desc(year) to sort descending.
select() and rename()
Select columns. Awesome: Specify ranges and/or exclusions by name, not number:
See ?select
for details. You can use helpers like starts_with()
, matches()
or contains()
.
Rename arguments by using named arguments:
This drops all other columns. If you want to keep them, use rename
:
distinct()
Extract unique values only. Similar but faster than base::unique()
mutate() and transmute()
mutate is similar to base::transform()
. It allows you to add new columns to a data frame:
If you want to drop the old variables, use transmute()
instead.
summarise()
Collapses a data frame into a single row:
Use any of R’s aggregation functions: min, max, mean, sum, sd, etc. Additionally, dplyr gives you n()
for counting, n_distinct()
for counting uniques, and first(x)
, last(x)
, and nth(x, n)
.
sample_n() and sample_frac()
Downsample a data frame to n observations or a specific fraction.
You can use replace=TRUE
for bootstrap samples and weights.
group_by()
This makes the above verbs very powerful. group_by()
returns the same data.frame, but with group attributes. The other functions (most notably summarise()
) now work separately on each subgroup:
The verbs are affected by grouping as follows:
- grouped
select()
is the same as ungroupedselect()
, excepted that retains grouping variables are always retained. - grouped
arrange()
orders first by grouping variables mutate()
andfilter()
are most useful in conjunction with window functions (likerank()
, ormin(x) == x
), and are described in detail invignette("window-function")
.sample_n()
andsample_frac()
sample the specified number/fraction of rows in each group.slice()
extracts rows within each group.summarise()
is easy to understand and very useful, and is described in more detail below.
tidyr: From wide to long data (replacing reshape2)
A newer, better version of reshape2. Integrates with dplyr.
You use gather()
instead of melt()
, and spread()
instead of cast()
.
Also you have separate()
and unite()
for splitting/combining column names if you have or want things like “male.control” and “female.treatment”.
An example for gather()
(I mostly only use this function):
So you provide (or pipe in) the data frame; with key
you specify the column name of the new ID variables; with value
you specify the column name of the measured variable; afterwards, you supply (unquoted) a comma-separated list of all measured variables, or a list of all ID variables, prepended with a minus sign.
Reading and storing big data
- The
data.table
package implements a child class (i.e. it’s compatible) of data.frame that speeds up many operations on it and reduces file size and the amount of implicit copying. - Use
fread
from thedata.table
package instead ofread.csv
, so CSV import takes only 2% of the time. library(readr)
is a Hadley package that provides simplified functions for reading data as well, e.g.read_csv()
.- Use the
rhdf5
package to apply HDF5 to store big data sets in a compressed but easily sliceable format. This allows you to extract only a rectangular slice from your data set, if the whole thing doesn’t fit into memory or would take too long to load.
Other packages from the Hadleyverse
lubridate
is a package for working with datesstringr
offers simple string manipulationtestthat
and assertthat for nice testing and assertionsdevtools
to facilitate code developingggvis
lets you create graphics ggplot2-style, but interactively playable in RStudio or the web, shiny-style.
I never read into the documentation of these packages, but just use them on a case-by-case basis. It’s helpful to keep in mind they exist, though.
Sources and more blog posts
- Source for the magrittr part
- Source for the dplyr part (but also
vignette("introduction", package="dplyr")
) - Source for the tidyr part
- http://adolfoalvarez.cl/the-hitchhikers-guide-to-the-hadleyverse/
- http://barryrowlingson.github.io/hadleyverse/#1