pipeR Tutorial

Pipe to first argument

Many R functions are pipe-friendly: they take some data by the first argument and transform it in a certain way. This arrangement allows operations to be streamlined by pipes, that is, one data source can be put to the first argument of a function, get transformed, and put to the first argument of the next function. In this way, a chain of commands are connected, and it is called a pipeline.

Here is an example of reorganizing code in pipeline written with elementary functions.

Suppose the original code is

summary(sample(diff(log(rnorm(100,mean = 10))),
  size = 10000,replace = TRUE))
#       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
# -0.3013000 -0.0830800 -0.0116400 -0.0004926  0.0723500  0.2795000

Note that rnorm(), log(), diff(), sample(), and summary() all take the data as the first argument. We can use %>>% to rewrite the code so that the process of data transformation is straightforward.

library(pipeR)
set.seed(123)
rnorm(100, mean = 10) %>>%
  log %>>%
  diff %>>%
  sample(size = 10000, replace = TRUE) %>>%
  summary
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
# -0.309500 -0.083720 -0.012360 -0.001854  0.071440  0.358400

The syntax of first argument piping is that, on the right-hand side of %>>%, whenever a function name or call is supplied, the left-hand side value will always be put to the first unnamed argument to that function.

Syntax Evaluate as
x %>>% f f(x)
x %>>% f(...) f(x,...)

Although you can write everything in one line but that is probably not very elegant. It is better to be generous to trade readability with the number of lines.

Note that, at each line, whenever we want to continue building the pipeline with the previous result, we end that line with %>>%. If one line does not end up with %>>%, the pipeline ends.

Some more examples with graphics functions:

mtcars$mpg %>>%
  plot
mtcars$mpg %>>%
  plot(col="red")

Sometimes the value on the left is needed at multiple places. In this case you can use . to represent it anywhere in the function call.

Plot mtcars$mpg with a title indicating the number of points.

mtcars$mpg %>>%
  plot(col="red", main=sprintf("number of points: %d",length(.)))

plot of chunk mtcars-with-title

Take a sample from the lower letters of half the population.

letters %>>%
  sample(size = length(.)/2)
#  [1] "p" "j" "o" "x" "q" "e" "z" "u" "i" "l" "f" "a" "m"

There are situations where one calls a function in a namespace with ::. In this case, the call must end up with parentheses with or without parameters.

mtcars$mpg %>>%
  stats::median()

mtcars$mpg %>>%
  graphics::plot(col = "red")

The same rule also applies when piping a value to a function in a list.

functions <- list(average = function(x) mean(x))
mtcars$mpg %>>% functions$average()
mtcars$mpg %>>% functions[["average"]]()

In both cases above, () is necessary to make R expect the symbol before () to be a function, or otherwise $ and [[ themselves will be understandably regarded as the function we want to pipe value to.

Notice that %>>% not only works between function calls, but also can be nested in function calls. For example,

mtcars %>>%
  subset(mpg <= quantile(mpg,0.95), c(mpg, wt)) %>>%
  summary
#       mpg              wt       
#  Min.   :10.40   Min.   :1.513  
#  1st Qu.:15.28   1st Qu.:2.772  
#  Median :18.95   Median :3.438  
#  Mean   :19.22   Mean   :3.297  
#  3rd Qu.:21.48   3rd Qu.:3.690  
#  Max.   :30.40   Max.   :5.424

can be written like

mtcars %>>%
  subset(mpg <= mpg %>>% quantile(0.95), c(mpg, wt)) %>>%
  summary
#       mpg              wt       
#  Min.   :10.40   Min.   :1.513  
#  1st Qu.:15.28   1st Qu.:2.772  
#  Median :18.95   Median :3.438  
#  Mean   :19.22   Mean   :3.297  
#  3rd Qu.:21.48   3rd Qu.:3.690  
#  Max.   :30.40   Max.   :5.424

One important thing to notice here is that pipeR does not support lazy evaluation on left value, that is, the left value will be evaluated immediately which cannot be substituted by the function on the right. One example that may be supposed to work but actually not is

10000 %>>% 
  replicate(rnorm(1000)) %>>%
  system.time
#    user  system elapsed 
#       0       0       0

This is not equivalent to

system.time(replicate(10000, rnorm(1000)))
#    user  system elapsed 
#    1.15    0.03    1.19

even if they actually cost almost the same time to compute. system.time() initiates a timing device when the evaluation starts. In this case however, the value on the left of %>>% is always evaluated before being put to the first argument of the function. That is why system.time() gets zero seconds because it only starts timing after the loop has finished! This is true for other functions that try to compute on language.

Therefore, you should always make sure that the left value should be valid in its own before putting it before %>>%.

In some other cases, the function is not very friendly to pipeline operation, that is, it does not take the data you transform through a pipeline as the first argument. One example is the linear model function lm(). This function take formula first and then data.

If you directly run

mtcars %>>%
  lm(mpg ~ cyl + wt)
# Error in as.data.frame.default(data): cannot coerce class ""formula"" to a data.frame

it will fail because %>>% is evaluating lm(mtcars, mpg ~ cyl + wt) which does not fulfill the expectation of the function. There are two ways to build pipeline with such kind of functions.

First, use named parameter to specify the formula.

mtcars %>>%
  lm(formula = mpg ~ cyl + wt)
# 
# Call:
# lm(formula = mpg ~ cyl + wt, data = .)
# 
# Coefficients:
# (Intercept)          cyl           wt  
#      39.686       -1.508       -3.191

This works because it is actually evaluated as

lm(mtcars, formula = mpg ~ cyl + wt)

and R's argument matching program decides that since the first argument in lm()'s definition formula is specified, the unnamed argument mtcars is regarded as specifying the second argument data, which is exactly what we want. Therefore, it works fine here.

However, this trick only makes it easy for some functions but not all. Suppose a function that takes data as the third or fourth argument. In this case, you would have to explicitly specify all previous arguments by name. If data argument follows ..., the trick would not work at all.

Dot piping is designed for more flexible pipeline construction. It allows you to use . to represent the left-hand side value and put it anywhere you want in the next expression. The next page demonstrates its syntax and when it might be useful.