Efficient method to filter and add based on certain conditions (3 conditions in this case)

Question

I have a data frame which looks like this

     a    b    c   d
     1    1    1   0
     1    1    1   200
     1    1    1   300
     1    1    2   0
     1    1    2   600
     1    2    3   0
     1    2    3   100
     1    2    3   200
     1    3    1   0

I have a data frame which looks like this

     a    b    c   d
     1    1    1   250
     1    1    2   600
     1    2    3   150
     1    3    1   0

I am currently doing it {

  n=nrow(subset(Wallmart, a==i &    b==j & c==k  ))
  sum=subset(Wallmart, a==i &    b==j & c==k  )
  #sum
  sum1=append(sum1,sum(sum$d)/(n-1))

}

I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250. Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like

     a    b    c   d
     1    1    1   250

This is a very inefficient way to do this work. The code takes a long time to run in a loop. so any help is appreciated that makes it run faster. The original data frame has about a million rows.

I don't see a loop. There seems to be something missing from your question. Anyway, never use `append` in a loop. — Roland, Apr 27 '14 at 08:53
Sorry, I edited the question, now it should be easy to understand. Thank You. — user2575429, Apr 27 '14 at 09:16

score 6 · Accepted Answer · edited Sep 27 '16 at 12:45

You may try aggregate:

aggregate(d ~ a + b + c, data = df, sum)
#   a b c   d
# 1 1 1 1 500
# 2 1 3 1   0
# 3 1 1 2 600
# 4 1 2 3 300

As noted by @Roland, for bigger data sets, you may try data.table or dplyr instead, e.g.:

library(dplyr)
df %>%
  group_by(a, b, c) %>%
  summarise(
    sum_d = sum(d))

# Source: local data frame [4 x 4]
# Groups: a, b
# 
#   a b c sum_d
# 1 1 1 1   500
# 2 1 1 2   600
# 3 1 2 3   300
# 4 1 3 1     0

Edit following updated question. If you want to calculate group-wise mean, excluding rows that are zero, you may try this:

aggregate(d ~ a + b + c, data = df, function(x) mean(x[x > 0]))
#   a b c   d
# 1 1 1 1 250
# 2 1 3 1 NaN
# 3 1 1 2 600
# 4 1 2 3 150

df %>%
  filter(d != 0) %>%
  group_by(a, b, c) %>%
  summarise(
    mean_d = mean(d))

#   a b c mean_d
# 1 1 1 1    250
# 2 1 1 2    600
# 3 1 2 3    150

However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to NA when preparing your data set, before the calculations.

df$d[df$d == 0] <- NA
df %>%
  group_by(a, b, c) %>%
  summarise(
    mean_d = mean(d, na.rm = TRUE))

#   a b c mean_d
# 1 1 1 1    250
# 2 1 1 2    600
# 3 1 2 3    150
# 4 1 3 1    NaN

+1 But for a million observations data.table or dplyr might be preferable. — Roland, Apr 27 '14 at 09:01
@Roland, Thanks for your comment! I added a `dplyr` alternative. — Henrik, Apr 27 '14 at 09:06

David Arenburg · Answer 2 · 2014-06-10T08:50:17.603

3

This is the data.table solution per your last edit.

library(data.table)
DT <- setDT(df)[, if(any(d[d > 0])) mean(d[d > 0]) else 0, by = c("a","b","c")]
# a b c  V1
# 1: 1 1 1 250
# 2: 1 1 2 600
# 3: 1 2 3 150
# 4: 1 3 1   0

Edit #2:

@Arun suggestion to speed it up

setDT(df)[, mean(d[d > 0]), by = c("a","b","c")][is.nan(V1), V1 := 0]

Edit #3

@eddis suggestion

setDT(df)[, sum(d) / pmax(1, sum(d > 0)), by = list(a, b, c)]

edited Jun 10 '14 at 08:50

answered Apr 27 '14 at 09:50

David Arenburg

91,361
17
137
196

Thank You David for suggesting the alternative method. NaN is not an issue I will fix it. – user2575429 Apr 27 '14 at 10:56
2

this is a little faster: `setDT(df)[, sum(d) / pmax(1, sum(d > 0)), by = list(a, b, c)]` – eddi Apr 28 '14 at 17:46

Shambho · Answer 3 · 2014-04-28T19:22:44.070

2

Here is another way:

Step1: Setup data table:

df <- read.table(text="     a    b    c   d
     1    1    1   0
     1    1    1   200
     1    1    1   300
     1    1    2   0
     1    1    2   600
     1    2    3   0
     1    2    3   100
     1    2    3   200
     1    3    1   0",header=T)
library(data.table)
setDT(df)
setkey(df,a,b,c)

Step2: Do the computation:

df[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(df)]

Note that looping is not recommended here. And best strategy is to vectorize the solution, as in the example above.

Step3: Lets test for timing:

> dt<-df
> for(i in 1:20) dt <- rbind(dt,dt)
> dim(dt)
[1] 9437184       4
> setkey(dt,a,b,c)
> dt[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(dt)]
   a b c  V1
1: 1 1 1 250
2: 1 1 2 600
3: 1 2 3 150
4: 1 3 1   0
> system.time(dt[,sum(d)/ifelse((cnt=length(which(d>0)))>0,cnt,1),by=key(dt)])
   user  system elapsed 
  0.495   0.090   0.609

So the computation for nearly 10M records is performed in about 0.5 sec!

Hope this helps!!

edited Apr 28 '14 at 19:22

answered Apr 28 '14 at 18:58

Shambho

3,250
1
24
37

two comments - it's not fair to set the key and then leave that out of your timing (not a huge deal, since setting the key doesn't change the speed by too much, but still), and see my comment in the other `data.table` answer for a simpler way of doing what you did – eddi Apr 28 '14 at 22:14
Thanks @eddi. On the first point: I was trying to illustrate the speed of execution, and as setting the key didn't take much time, so I didn't include it. However, I noticed an important thing here, the 20 fold `rbind` on `dt` runs much faster in comparison to the 20 fold `rbind` on `df`. Any comments on that?? Second point is very well taken and really appreciated! – Shambho Apr 29 '14 at 01:38
Not sure what to comment except that the `data.table` `rbind` is just better :) It uses `rbindlist` internally, which is really fast. – eddi Apr 29 '14 at 18:15

Efficient method to filter and add based on certain conditions (3 conditions in this case)

3 Answers3

Edit #2:

Edit #3

Linked

Related