Many R functions are pipe-friendly: they take some data by the first argument and transform it in a certain way. This arrangement allows operations to be streamlined by pipes, that is, one data source can be put to the first argument of a function, get transformed, and put to the first argument of the next function. In this way, a chain of commands are connected, and it is called a pipeline.
Here is an example of reorganizing code in pipeline written with elementary functions.
Suppose the original code is
summary(sample(diff(log(rnorm(100,mean = 10))),
size = 10000,replace = TRUE))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.3013000 -0.0830800 -0.0116400 -0.0004926 0.0723500 0.2795000
Note that rnorm()
, log()
, diff()
, sample()
, and summary()
all take the data as the first argument. We can use %>>%
to rewrite the code so that the process of data transformation is straightforward.
library(pipeR)
set.seed(123)
rnorm(100, mean = 10) %>>%
log %>>%
diff %>>%
sample(size = 10000, replace = TRUE) %>>%
summary
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.309500 -0.083720 -0.012360 -0.001854 0.071440 0.358400
The syntax of first argument piping is that, on the right-hand side of %>>%
, whenever a function name or call is supplied, the left-hand side value will always be put to the first unnamed argument to that function.
Syntax | Evaluate as |
---|---|
x %>>% f |
f(x) |
x %>>% f(...) |
f(x,...) |
Although you can write everything in one line but that is probably not very elegant. It is better to be generous to trade readability with the number of lines.
Note that, at each line, whenever we want to continue building the pipeline with the previous result, we end that line with %>>%
. If one line does not end up with %>>%
, the pipeline ends.
Some more examples with graphics functions:
mtcars$mpg %>>%
plot
mtcars$mpg %>>%
plot(col="red")
Sometimes the value on the left is needed at multiple places. In this case you can use .
to represent it anywhere in the function call.
Plot mtcars$mpg
with a title indicating the number of points.
mtcars$mpg %>>%
plot(col="red", main=sprintf("number of points: %d",length(.)))
Take a sample from the lower letters of half the population.
letters %>>%
sample(size = length(.)/2)
# [1] "p" "j" "o" "x" "q" "e" "z" "u" "i" "l" "f" "a" "m"
There are situations where one calls a function in a namespace with ::
. In this case, the call must end up with parentheses with or without parameters.
mtcars$mpg %>>%
stats::median()
mtcars$mpg %>>%
graphics::plot(col = "red")
The same rule also applies when piping a value to a function in a list.
functions <- list(average = function(x) mean(x))
mtcars$mpg %>>% functions$average()
mtcars$mpg %>>% functions[["average"]]()
In both cases above, ()
is necessary to make R expect the symbol before ()
to be a function, or otherwise $
and [[
themselves will be understandably regarded as the function we want to pipe value to.
Notice that %>>%
not only works between function calls, but also can be nested in function calls. For example,
mtcars %>>%
subset(mpg <= quantile(mpg,0.95), c(mpg, wt)) %>>%
summary
# mpg wt
# Min. :10.40 Min. :1.513
# 1st Qu.:15.28 1st Qu.:2.772
# Median :18.95 Median :3.438
# Mean :19.22 Mean :3.297
# 3rd Qu.:21.48 3rd Qu.:3.690
# Max. :30.40 Max. :5.424
can be written like
mtcars %>>%
subset(mpg <= mpg %>>% quantile(0.95), c(mpg, wt)) %>>%
summary
# mpg wt
# Min. :10.40 Min. :1.513
# 1st Qu.:15.28 1st Qu.:2.772
# Median :18.95 Median :3.438
# Mean :19.22 Mean :3.297
# 3rd Qu.:21.48 3rd Qu.:3.690
# Max. :30.40 Max. :5.424
One important thing to notice here is that pipeR does not support lazy evaluation on left value, that is, the left value will be evaluated immediately which cannot be substituted by the function on the right. One example that may be supposed to work but actually not is
10000 %>>%
replicate(rnorm(1000)) %>>%
system.time
# user system elapsed
# 0 0 0
This is not equivalent to
system.time(replicate(10000, rnorm(1000)))
# user system elapsed
# 1.15 0.03 1.19
even if they actually cost almost the same time to compute. system.time()
initiates a timing device when the evaluation starts. In this case however, the value on the left of %>>%
is always evaluated before being put to the first argument of the function. That is why system.time()
gets zero seconds because it only starts timing after the loop has finished! This is true for other functions that try to compute on language.
Therefore, you should always make sure that the left value should be valid in its own before putting it before %>>%
.
In some other cases, the function is not very friendly to pipeline operation, that is, it does not take the data you transform through a pipeline as the first argument. One example is the linear model function lm()
. This function take formula
first and then data
.
If you directly run
mtcars %>>%
lm(mpg ~ cyl + wt)
# Error in as.data.frame.default(data): cannot coerce class ""formula"" to a data.frame
it will fail because %>>%
is evaluating lm(mtcars, mpg ~ cyl + wt)
which does not fulfill the expectation of the function. There are two ways to build pipeline with such kind of functions.
First, use named parameter to specify the formula.
mtcars %>>%
lm(formula = mpg ~ cyl + wt)
#
# Call:
# lm(formula = mpg ~ cyl + wt, data = .)
#
# Coefficients:
# (Intercept) cyl wt
# 39.686 -1.508 -3.191
This works because it is actually evaluated as
lm(mtcars, formula = mpg ~ cyl + wt)
and R's argument matching program decides that since the first argument in lm()
's definition formula
is specified, the unnamed argument mtcars
is regarded as specifying the second argument data
, which is exactly what we want. Therefore, it works fine here.
However, this trick only makes it easy for some functions but not all. Suppose a function that takes data
as the third or fourth argument. In this case, you would have to explicitly specify all previous arguments by name. If data
argument follows ...
, the trick would not work at all.
Dot piping is designed for more flexible pipeline construction. It allows you to use .
to represent the left-hand side value and put it anywhere you want in the next expression. The next page demonstrates its syntax and when it might be useful.