The many flavors of apply

Ben Bolker

September 13, 2010

PIC

Licensed under the Creative Commons attribution-noncommercial license (http://creativecommons.org/licenses/by-nc/3.0/). Please share & remix noncommercially, mentioning its origin.

One of the more powerful capabilities of R is the “apply” family. These are functions whose purpose is to take an R function and some R object that represents “a set of things” and apply the function to each element in the set. You can often achieve the same results with a for loop, stepping through the elements of the set one by one, but the equivalent *apply commands are (1) more compact, making code easier to read [at least if you understand them!], (2) slightly more convenient — various bookkeeping such as figuring out the number of elements in the set and setting aside storage for the results gets done automatically, (3) more “idiomatic” in R (in case that matters to you), and (4) [sometimes] more efficient [although it is no longer always the case, as it was in early versions of S-PLUS, that for loops are much less efficient than the apply commands].

This general approach to programming (define a function, then apply it to a set of objects) is called (not too surprisingly) functional programming (http://en.wikipedia.org/wiki/Functional_programming). This style of programming started out in LISP, and is also very common in Mathematica (where it is represented by the Map function).

*applying is easiest when an existing function does what you want, but you can also define functions on the fly. For example, R doesn’t have a square() function. You could define it:

  > square <- function(x) {
     x^2
     }
  > sapply(1:5,square)
  [1]  1  4  9 16 25

but for this kind of short function you can just say

  > sapply(1:5,function(x) {x^2})
  [1]  1  4  9 16 25

(Mathematica has an even slicker way to do this.)

You can also omit the curly brackets when your function consists of a single statement. If it has more than one you can use semicolons to keep all the statements on the same line, for compactness; e.g.

  > sapply(1:5,function(x) {y <- x; y^2})
  [1]  1  4  9 16 25

(although in this case the extra statement is obviously pointless).

You’d also be surprised sometimes what can be used as a function:

  > sapply(1:5,"^",2)
  [1]  1  4  9 16 25

This example also represents a powerful and sometimes overlooked feature of *apply: extra arguments get passed through to the function you are applying. This is particularly handy when you want to apply the function to a vector but use the vector as something other than the first argument to the function. For example, suppose we wanted to run a linear regression on a series of different data sets. Rather than

  > datlist = list(dat1,dat2,dat3)
  > lapply(datlist, function(d) lm(y~x,data=d))

we could just say

  > datlist = list(dat1,dat2,dat3)
  > lapply(datlist, lm, formula=y~x)

R will fill in the formula argument and then use the elements of datlist for the next unfilled argument, which in this case is data.

Note that applying can also be overdone: See section 4 of Patrick Burns’ “R Inferno” (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) (which is a pleasure to read in general).

Reproduced and slightly extended from that reference:

function input output

comment





apply matrix or array vector or array or list

lapply list or vector list

sapply list or vector vector or matrix or list

simplify

tapply data, categories array or list

ragged

mapply lists and/or vectors vector or matrix or list

multiple

rapply list vector or list

recursive

eapply environment list

dendrapply dendogram dendogram

zoo::rollapply data similar to input

emdbook::apply2d two vectors matrix

multicore::mclapply same as lapply same as lapply

parallelize across cores (OK on Unix, experimental for Windows (pre-Vista only): see http://rforge.net/multicore)

kernapply has the same pattern, but I don’t think it is really in the *apply family.

Also: simFrame::simApply, functions in Rmpi (mpi.parapply, mpi.iapply, mpi.apply), gridR::apply, RMySQL::dbApply, RPostgreSQL::dbApply, PerformanceAnalytics::apply.rolling, ff::ffapply, xts::{period.apply,apply.monthly}, etc. etc. etc.. (these are the results of sos::findFn("apply")). Also nlme::gapply.

1 apply

Apply fun to the “margins” of a matrix or array. “Margin” here means row, column, or other “slices” of a higher-dimensional array. The MARGIN argument is 1 for rows, 2 for columns, and n for another dimension of a higher-dimensional array. You can give more than one margin:

  > m = matrix(1:4,byrow=TRUE,ncol=2)
  > apply(m,c(1,2),function(x) x^2)

       [,1] [,2]
  [1,]    1    4
  [2,]    9   16

Of course, in this case we don’t do any better than just saying m^2. But we could apply over more than one, but not all, dimensions of an array with > 2 dimensions.

colSums, rowSums, colMeans, rowMeans are special cases that are considerably faster than the equivalent apply commands. (I think there’s an equivalent for the median somewhere in a Bioconductor package.)

2 lapply

Apply a function to a list.

3 sapply

Apply a function to a list, or a vector (this is handy so you don’t have to say lapply(as.list(x)), and simplify the results if possible.

4 mapply

Apply a function of multiple arguments to multiple lists. I sometimes use this as a shortcut where I should probably just give up and use a for loop.

  > mapply(function(dat,i) {
     plot(dat$x,dat$y,col=i)
     text(1,2,names(dat)[i])
     },
     datlist,1:length(datlist))

it would be great to have a way within an *apply function to access the current value of the index (or name of the current element) but I don’t know of one …

Additional arguments have to be specified explicitly with MoreArgs. Depending on what you’re doing you may want SIMPLIFY to be TRUE or FALSE

Related functions

function

purpose

do.call

apply a function to a list of arguments

replicate

repeat an expression many times

outer

apply a function to all combinations of two vectors (function must be vectorized — otherwise see emdbook::apply2d

Map

equivalent to mapply: see ?funprog

Reduce

apply a function to successively combine elements

cumsum

(and cummax, cummin, cumprod): cumulative functions

plyr::ddply

(and friends) split an object, apply a function to chunks, then recombine the chunks (split/tapply/rbind on steroids)

For the truly clever: why does this work?

  > N <- 0; replicate(20,N <<- N-round(0.25*N)+10)

   [1] 10 18 24 28 31 33 35 36 37 38 38 38 38 38 38 38 38 38 38 38