dplyr, the foundational tidyverse package, makes a trade-off between being easy to code in interactively at the expense of being more difficult to create functions with. The source of the trade-off is in how dplyr
evaluates column names (specifically, allowing for unquoted column names as argument inputs). Tidy evaluation has been under major development the last couple of years in order to make programming with dplyr easier.
During this development, there have been a variety of proposed methods for programming with dplyr
. In this post, I will document the current ‘best-practices’ with dplyr
1.0.0. In the Older approaches section I provide analogous examples that someone (i.e. myself) might have used during this maturation period.
For a more full discussion on this topic see dplyr
’s documentation at programming with dplyr and the various links referenced there.
Function expecting one column
library(tidyverse)
Pretend we want to create a function that calculates the sum of a given variable in a dataframe:
sum_var <- function(df, var){
summarise(df, {{var}} := sum({{var}}))
}
To run this function:
sum_vars(mpg, cty)
If you wanted to edit the variable in place and avoid using the special assignment operator :=
, you could use the new (in dplyr
1.0.0) across()
function.
sum_vars <- function(df, vars){
summarise(df, across({{vars}}, sum))
}
Functions allowing multiple columns
Using the across()
approach also allows you to input more than one variable, e.g. a user could call the following to get summaries on both cty
and hwy
.
sum_vars(mpg, c(cty, hwy))
If you wanted to compute multiple column summaries with different functions and you wanted to glue the function name onto your outputted column names1, you could instead pass a named list of functions into the .fns
argument of across()
.
sum_vars <- function(df, vars){
summarise(df, across({{vars}}, list(sum = sum, mean = mean)))
}
You might want to create a function that can take in multiple sets of columns, e.g. the function below allows you to group_by()
one set of variables and summarise()
another set:
sum_group_vars <- function(df, group_vars, sum_vars){
df %>%
group_by(across({{group_vars}})) %>%
summarise(across({{sum_vars}}, list(sum = sum, mean = mean)))
}
How a user would run sum_group_vars()
:
sum_group_vars(mpg,
c(model, year),
c(hwy, cty))
If you’re feeling fancy, you could also make the input to .fns
an argument to sum_group_vars()
2.
Older approaches
Generally, I find the new across()
approaches introduced in dplyr
1.0.0 are easier and more consistent to use than the methods that preceded them. However the methods in this section still work and are supported. They are just no longer the ‘recommended’ or most ‘modern’ approach available for creating functions that pass column names into dplyr
verbs.
Prior to the introduction of the bracket-bracket, {{}}
, you would have used the enquo()
+ bang-bang approach3. The function below is equivalent to the sum_var()
function shown at the start of this post.
sum_var <- function(df, var){
var_quo <- enquo(var)
summarise(df, !!var_quo := sum(!!var_quo))
}
To modify variables in-place you would have used the *_at()
, *_if()
or *_all()
function variants (which are now superseded by across()
).
sum_vars <- function(df, vars){
summarise_at(df, {{vars}}, sum)
}
Similar to using across()
this method allows multiple variables being input. However what is weird about this function is that it requires the user wrapping the variable names in vars()
4. Hence to use the previously created function, a user would run:
sum_vars(mpg, vars(hwy, cty))
Alternatively, you could have the variable name inputs be character vectors by modifying the function like so:
sum_var <- function(df, vars){
summarise_at(df, vars(one_of(vars)), sum)
}
Which could be called by a user as:
sum_var(mpg, c("hwy", "cty"))
These *_at()
variants also support inputting a list of functions, e.g. the below function would output both the sums and means.
sum_var <- function(df, var){
summarise_at(df, vars(one_of(var)), list(sum = sum, mean = mean))
}
For multiple grouping variables and multiple variables to be summarised you could create:
groupsum <- function(df, group_vars, sum_vars){
df %>%
group_by_at(vars(one_of(group_vars))) %>%
summarise_at(vars(one_of(sum_vars)), list(sum = sum, mean = mean))
}
Which would be called by a user:
sum_var(mpg,
c("model", "year"),
c("hwy", "cty"))
There are a variety of similar spins you might take on handling tidy evaluation when creating these or similar types of functions.
One other older approach perhaps worth mentioning (presented here) is “passing the dots”. Here is an example for if we want to group_by()
multiple columns and then summarise()
on just one column:
sum_group_var <- function(df, sum_var, ...){
df %>%
group_by(...) %>%
summarise({{sum_var}} := sum({{sum_var}}))
}
The limitation with this approach is that only one set of your inputs can have more than one variable in it, i.e. wherever you pass the ...
in your function.
Appendix
Image shared on social media was created using xaringan
and flair
. See dplyr-1.0.0-example for details.
dplyr
1.0.0 also now has support for using the glue package syntax for modifying variable names.↩︎Doing this doesn’t require any tidy evaluation knowledge↩︎
There is also the
rlang::enquos()
and!!!
operator for when the input has length greater than one.↩︎A niche function specific to tidy evaluation (which users might not think of).↩︎