I often find myself needing to apply a small number of rule-based transformations to dataframes based on certain conditions, typically a fixed number of fields having certain values. The transformations can modify any number of columns, usually one to three. The number of rows involved in these transformations is small compared to the total number of rows in the data frames. Currently I am using ddply
but the performance is lacking because ddply
modifies all rows.
I am looking for a way to solve this problem in an elegant, generic manner taking advantage of the fact that only a small number of rows needs to be changed. Below is a simplified example of the types of transformations I'm dealing with.
df <- data.frame(Product=gl(4,10,labels=c("A","B", "C", "D")),
Year=sort(rep(2002:2011,4)),
Quarter=rep(c("Q1","Q2", "Q3", "Q4"), 10),
Sales=1:40)
> head(df)
Product Year Quarter Sales
1 A 2002 Q1 1
2 A 2002 Q2 2
3 A 2002 Q3 3
4 A 2002 Q4 4
5 A 2003 Q1 5
6 A 2003 Q2 6
>
transformations <- function(df) {
if (df$Year == 2002 && df$Product == 'A') {
df$Sales <- df$Sales + 3
} else if (df$Year == 2009 && df$Product == 'C') {
df$Sales <- df$Sales - 10
df$Product <- 'E'
}
df
}
library(plyr)
df <- ddply(df, .(Product, Year), transformations)
> head(df)
Product Year Quarter Sales
1 A 2002 Q1 4
2 A 2002 Q2 5
3 A 2002 Q3 6
4 A 2002 Q4 7
5 A 2003 Q1 5
6 A 2003 Q2 6
Insted of hard-coded conditionals I'm using a pairlist of conditional and transformation functions, e.g., the code below, but that's not a meaningful improvement.
transformation_rules <- list(
list(
condition = function(df) df$Year == 2002 && df$Product == 'A',
transformation = function(df) {
df$Sales <- df$Sales + 3
df
}
)
)
What are some better ways to approach this problem?