1

I often find myself needing to apply a small number of rule-based transformations to dataframes based on certain conditions, typically a fixed number of fields having certain values. The transformations can modify any number of columns, usually one to three. The number of rows involved in these transformations is small compared to the total number of rows in the data frames. Currently I am using ddply but the performance is lacking because ddply modifies all rows.

I am looking for a way to solve this problem in an elegant, generic manner taking advantage of the fact that only a small number of rows needs to be changed. Below is a simplified example of the types of transformations I'm dealing with.

df <- data.frame(Product=gl(4,10,labels=c("A","B", "C", "D")), 
                 Year=sort(rep(2002:2011,4)), 
                 Quarter=rep(c("Q1","Q2", "Q3", "Q4"), 10), 
                 Sales=1:40)           
> head(df)
  Product Year Quarter Sales
1       A 2002      Q1     1
2       A 2002      Q2     2
3       A 2002      Q3     3
4       A 2002      Q4     4
5       A 2003      Q1     5
6       A 2003      Q2     6
> 
transformations <- function(df) {
    if (df$Year == 2002 && df$Product == 'A') {
        df$Sales <- df$Sales + 3
    } else if (df$Year == 2009 && df$Product == 'C') {
        df$Sales <- df$Sales - 10
        df$Product <- 'E'
    }
    df
}

library(plyr)
df <- ddply(df, .(Product, Year), transformations)

> head(df)
  Product Year Quarter Sales
1       A 2002      Q1     4
2       A 2002      Q2     5
3       A 2002      Q3     6
4       A 2002      Q4     7
5       A 2003      Q1     5
6       A 2003      Q2     6

Insted of hard-coded conditionals I'm using a pairlist of conditional and transformation functions, e.g., the code below, but that's not a meaningful improvement.

transformation_rules <- list(
  list(
    condition = function(df) df$Year == 2002 && df$Product == 'A',
    transformation = function(df) {
      df$Sales <- df$Sales + 3
      df
    }
  )
)

What are some better ways to approach this problem?

Sim
  • 13,147
  • 9
  • 66
  • 95

1 Answers1

2

I don't think you need to use plyr for this problem at all. I think you can simply use ifelse() and take advantage of the fact that R is vectorized and get the same result.

Since your function directly modifies the Sales column, I made a copy of it like so before running plyr: df2 <- df. I also had my example make a new column Sales2 instead of overwriting the Sales column.

And then rewrote your function like this:

df2$Sales2 <- with(df2, ifelse(Year == 2002 & Product == "A", Sales + 3,
                        ifelse(Year == 2009 & Product == "C", Sales - 10, Sales)))

And testing whether or not the output is equal:

> all.equal(df$Sales, df2$Sales2)
[1] TRUE

And comparing the system timing between the two shows that the vectorized version avoiding ddply is much faster:

> system.time(df <- ddply(df, .(Product, Year), transformations))
   user  system elapsed 
  0.012   0.000   0.012 
> system.time(df2$Sales2 <- with(df2, ifelse(Year == 2002 & Product == "A", Sales + 3,
+                         ifelse(Year == 2009 & Product == "C", Sales - 10, Sales))))
   user  system elapsed 
      0       0       0 

So, unless I'm missing something - you can avoid plyr all together here and get some nice speed improvements. If ifelse() proves too slow, you could write up some boolean functions to go even faster, though I doubt that will be necessary.

Chase
  • 67,710
  • 18
  • 144
  • 161
  • Chase, the example that modifies the Sales column was just that--an example. In actuality, I need to modify several columns at times. I have updated the question to reflect this. Would you suggest duplicating the ifelse() conditionals? – Sim Jul 04 '12 at 14:58