1

I'm running a ddply function and keep getting an error.

Structure of data.frame:

str(visits.by.user)
'data.frame':   80317 obs. of  5 variables:
 $ ClientID    : Factor w/ 147792 levels "50912733","50098716",..: 1 3 4 5 6 7 8 10 11 12 ...
 $ TotalVisits      : int  64 231 18 21 416 290 3 13 1 7 ...
 $ TotalDayVisits: int  8 141 0 4 240 155 0 0 0 0 ...
 $ TotalNightVisits: int  56 90 18 17 176 135 3 13 1 7 ...
 $ quintile         : Factor w/ 5 levels "0-20","20-40",..: 5 5 4 4 5 5 2 4 1 3 ...

Side note: I know how to create sample data for random numeric data - How do you apply a factor with 5 levels to build a representative sample?

ddply Code:

summary.users <- ddply(data = subset(visits.by.user, TotalVisits > 0), 
                          .(quintile, TotalDayVisits, TotalNightVisits), 
                          summarize,
                          NumClients = length(ClientID))

Error Message:

Error in if (empty(.data)) return(.data) : 
 missing value where TRUE/FALSE needed

I thought that maybe ddply would require the variable I'm trying to group on to be a factor, so I tried a as.factor on the integer variables, but that didn't work.

Can anyone see where i'm going wrong?

Edit: Adding top part of dput

structure(list(ClientID = structure(c(1L, 2L, 3L, 4L, 5L, 6L), .Label = c("50912733", "60098716", "50087112", "94752212", "78217771", "12884545"), class = "factor"),TotalVisits = c(80L, 92L, 103L, 18L, 182L, 136L), TotalDayVisits = c(56L, 90L, 18L, 17L, 176L, 135L), TotalNightVisits = c(24L, 2L, 85L, 1L, 6L, 1L), quintile = structure(c(5L, 5L, 4L, 4L, 5L, 5L), .Label = c("0-20", "20-40", "40-60", "60-80", "80-100"), class = "factor")), .Names = c("ClientID", "TotalVisits", "TotalDayVisits", "TotalNightVisits", "quintile"), row.names = c(NA,6L), class = "data.frame")
mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • 1
    Can you update your question with the results of `dput(head(visits.by.user))`? – Maiasaura Aug 01 '12 at 21:25
  • You are trying to return the number of rows in each subset. To do this, your code should be `NumClients = nrow`. This might solve your problem. – Andrie Aug 01 '12 at 21:31
  • @Andrie no luck on that, but that's exactly what I'm trying to get it. – mikebmassey Aug 01 '12 at 22:05
  • @Maiasaura Added. Hope that's enough. Thanks – mikebmassey Aug 01 '12 at 22:05
  • 2
    Your first argument is named `data=`. `ddply` takes a first argument named `.data`. If I change this, your code runs fine. However, I suspect you may also run into problems with `quintile` as a factor. you can wrap your `subset()` in `droplevels()` if you do. – Justin Aug 01 '12 at 22:36
  • Other than the gotcha with `.data` argument, your question about the cardinality of a ddply split on multiple variables is a duplicate of [Must ddply use all possible combinations of the splitting variable(s), or only observed?](http://stackoverflow.com/questions/16363834/must-ddply-use-all-possible-combinations-of-the-splitting-variables-or-only-o) – smci Apr 01 '14 at 06:54

2 Answers2

6

Your first argument is named data= while ddply takes a first argument named .data. If I change this, your code runs fine.

Regarding my comment, this was a problem that I thought I had run into in the past, but it seems like there is an implicit call to something like droplevels within the ddply mechanics. I'd love to hear a more in depth explanation of how its working!

dat <- data.frame(x=1:20, z=factor(rep(letters[1:4], each=5)))

ddply(dat, .(z), summarise, length(x))
  z ..1
1 a   5
2 b   5
3 c   5
4 d   5
ddply(subset(dat, z!='a'), .(z), summarise, length(x))
  z ..1
1 b   5
2 c   5
3 d   5

Which behaves nicely. However looking at the factor levels sort of surprised me:

ddply(subset(dat, z!='a'), .(z), summarise, paste(levels(z), collapse=' '))
  z     ..1
1 b a b c d
2 c a b c d
3 d a b c d
mnel
  • 113,303
  • 27
  • 265
  • 254
Justin
  • 42,475
  • 9
  • 93
  • 111
  • 1
    There is an argument `.drop` (which defaults to `TRUE`) for `ddply`. This drops combinations that do not occur in the data. If you run `ddply(subset(dat, z!='a'), .(z), summarise, length(x), .drop = F)`, the first row will be `a, 0` – mnel Aug 02 '12 at 00:21
  • I thought I was being thorough by adding `data = `, like you should with `ggplot`. Thanks for the help. – mikebmassey Aug 02 '12 at 14:07
  • 2
    @mikebmassey you were! Except the argument isn't `data` its `.data` – Justin Aug 02 '12 at 14:39
0

This worked fine:

summary.users <- ddply(subset(visits.by.user, TotalVisits > 0), 
                          .(quintile, TotalDayVisits, TotalNightVisits), 
                          summarize, NumClients = length(ClientID))

> summary.users
  quintile TotalDayVisits TotalNightVisits NumClients
1    60-80             17                1          1
2    60-80             18               85          1
3   80-100             56               24          1
4   80-100             90                2          1
5   80-100            135                1          1
6   80-100            176                6          1
Maiasaura
  • 32,226
  • 27
  • 104
  • 108