dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

Question

I am computing a dplyr::summarize across a dataframe of sales data. I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).

Two questions about my code idiom:

Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read. Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.

Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:

# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
  summarize(ysd_med543_X  = median(X),
            ysd_mean543_X = mean(X)
           ) %.% ungroup()

# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)

and run this first to generate sample-data:

set.seed(1234)

rep_vector <- function(vv, n) {
  unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}

n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
                Y = rep_vector(2005:2007, n),
                Week = round(52*runif(m*n)),
                X = 4e4*runif(m*n) + 1e4 )

# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]

require(dplyr)

"You're not going to get anything more". Sorry, but it shouldn't be hard to cook-up some random data. — Arun, May 03 '14 at 00:02
@Arun: due to popular demand above is sample data... not that it was strictly needed... now may I expect some answers? — smci, May 03 '14 at 02:59
I guess the root-cause is [dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output](http://stackoverflow.com/questions/22523131/dplyr-summarise-equivalent-of-drop-false-to-keep-groups-with-zero-length-in) ; https://github.com/hadley/dplyr/issues/341 . But I need a workaround, today. — smci, May 03 '14 at 03:11
To the downvoters: if you really want to be annoying, the code is from a live data-science competition whose rules prevent sharing code. To be clear, I'm not sharing any code here, because this snippet is very heavily sanitized. This question is only about the general dplyr idiom. It's really aggravating me to have two people downvote, but can't be bothered doing anything useful once I do provide sample data. — smci, May 03 '14 at 08:06

Arun · Answer 1 · 2014-05-03T09:04:57.043

8

I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.

I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:

df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...

This is just so that you're aware of it for any future cases.

IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
             md_X = median(X[Week >=5 & Week <= 43]), 
             mn_X = mean(X[Week >=5 & Week <= 43]))

Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.

What I've in mind is something like:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
              { tmp = Week >= 5 & Week <= 43;
                md_X = ifelse(length(tmp), median(X[tmp]), 0), 
                md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
              })

So, we'll have to workaround in this manner probably:

df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0), 
                  mn_X = ifelse(tmp[1L], mean(X), 0))

Or to put things together:

df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43, 
       md_X = ifelse(tmp[1L], median(X), 0), 
       mn_X = ifelse(tmp[1L], median(X), 0)) 

#     S  D    Y Week        X   tmp     md_X     mn_X
# 1  10 20 2005    6 22107.73  TRUE 22107.73 22107.73
# 2  10 23 2005   32 18751.98  TRUE 18751.98 18751.98
# 3  10 25 2005   33 31027.90  TRUE 31027.90 31027.90
# 4  10 26 2005    0 46586.33 FALSE     0.00     0.00
# 5  11 20 2006   12 43253.80  TRUE 43253.80 43253.80
# 6  11 22 2006   27 28243.66  TRUE 28243.66 28243.66
# 7  11 23 2006   36 20607.47  TRUE 20607.47 20607.47
# 8  11 24 2006   28 22186.89  TRUE 22186.89 22186.89
# 9  11 25 2006   15 30292.27  TRUE 30292.27 30292.27
# 10 12 20 2007   15 40386.83  TRUE 40386.83 40386.83
# 11 12 21 2007   44 18049.92 FALSE     0.00     0.00
# 12 12 26 2007   16 35856.24  TRUE 35856.24 35856.24

which doesn't require df.tmp.

HTH

edited May 03 '14 at 09:04

answered May 03 '14 at 08:12

Arun

116,683
26
284
387

I want dplyr not data.table, thanks. That's why its tagged dplyr. – smci May 03 '14 at 08:13
@smci, please *read* the entire answer before commenting. It just helps me to think in `data.table`. I've provided a `dplyr` solution. – Arun May 03 '14 at 08:13
I already *read* the whole answer in full. And I was typing point-by-point responses to your comments. It's really aggravating how you keep talking down to me. Can you fix your attitude? Please stop assuming incorrect things. – smci May 03 '14 at 08:15
3

You mean you're offended that I used `data.table` to get to the `dplyr` solution? Many things seem to be aggravating you. – Arun May 03 '14 at 08:18
I've edited out the `data.table` part. Hopefully, that'll calm you down at least a bit. Ciao. – Arun May 03 '14 at 08:23
Issue a) asked you to comment on the `filter` vs `subset` tradeoff ( or logical indexing as you did). I'm willing to take some performance hit on repeatedly doing `subset()` inside each `summarize(...` computation, in return for avoiding having to assign to a temporary `df.tmp` then left_join back to the parent df. The `mutate` you do with logical indexing is nice, I think we can avoid the need for `df.tmp` by creating tmp_X directly on df, then processing it, and finally dropping it with select(-tmp_X). – smci May 03 '14 at 08:26
Issue b) is still not solved, because this doesn't eliminate the temporary `df.tmp`, `summarize()` still blows up on empty groups: `summarize(empty_group = (length(X)==0 || length(X[Week>=5 & Week<=43,])==0), ysd_med543_X = ifelse(!empty_group, median(X[Week>=5 & Week<=43,]), 0)...` gives **`Error in summarise_impl(.data, named_dots(...), environment()) : attempt to use zero-length variable name`** – smci May 03 '14 at 08:53
And I believe that's the same issue as the linked dplyr issue: [dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output](http://stackoverflow.com/questions/22523131/dplyr-summarise-equivalent-of-drop-false-to-keep-groups-with-zero-length-in) ; [github.com/hadley/dplyr/issues/341](github.com/hadley/dplyr/issues/341) . Causing that `Error:... attempt to use zero-length variable name`. – smci May 03 '14 at 08:56
But your `mutate()` solves things by avoiding `summarize()` entirely. Thanks! – smci May 03 '14 at 09:08
and don't forget to `select(-tmp)` – smci May 03 '14 at 09:13
1

`md_X = median(tmp * X)` and similarly for `mean` could be used for a small reduction in code size. `md_X = tmp[1L] * median(X)` would be another approach. – G. Grothendieck May 03 '14 at 12:40
Guys, it's nearly perfect, but I see intermittent weirdness if I directly assign the result `df <- df %.% group_by(S,D,Y) %.% mutate(...) %.% ungroup()` Specifically, sometimes the new columns are all correct, but at other times they're all zero! How does that happen?? – smci May 04 '14 at 01:19

dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

1 Answers1

Linked