5

I have a data set containing some groups and I want to calculate the number of records in each group, where a certain condition is met. I then want to expand the result to the rest of the records within each group (i.e. where the condition is not met) because I am collapsing the table later.

I'm using data.table to do this, and the .N function to calculate the number of records within each group that meet my condition. I then get the max of all the values within each group to apply the result to all records within each group. My data set is quite large (nearly 5 million records).

I keep getting the following error:

  Error in `[.data.table`(dpart, , `:=`(clustersize4wk, max(clustersize4wk,  : 
  Type of RHS ('double') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

At first, I assumed that using .N was producing an integer, whereas getting the max of the values by group was producing a double, however this does not seem to be the case (in the toy example below, the class of the results column remains as integer throughout) and I'm unable to reproduce the problem.

For illustration, here is an example:

# Example data:

mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
                   grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
                   name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
                   sex = c("m", "m", "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"), 
                   age = c(2,12,29,15,30,75,5,4,7,55,43,39), 
                   reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201750", "201752", "201752", "201801"))

I am calculating the number within each group that are male like this:

mydt[sex == "m", csize := .N, by = id]

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

Some groups do not contain any males, so to avoid getting Inf in the next step I recode NA as 0:

mydt[ is.na(csize), csize := 0]

I then expand the result to all members within each group like this:

mydt[, csize := max(csize, na.rm = T), by = id] 

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

This is the point at which the error appears in my real data set. If I omit the step to recode NAs to 0 I can reproduce the error with the example data; otherwise not. Also with my real data set (in spite of having recoded NAs to 0) I still get the following warning:

19: In max(clustersize4wk, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf 

How can I resolve this?

My expected output is below:

> mydt
    id grp    name sex age reportweek csize
 1:  a  G1    Jack   m   2     201740     2
 2:  a  G1    John   m  12     201750     2
 3:  b  G1    Jill   f  29     201801     2
 4:  b  G1     Joe   m  15     201801     2
 5:  b  G1     Jim   m  30     201801     2
 6:  c  G2   Julia   f  75     201748     1
 7:  c  G2  Simran   m   5     201748     1
 8:  c  G2   Delia   f   4     201749     1
 9:  c  G2  Aurora   f   7     201750     1
10:  d  G2 Daniele   f  55     201752     0
11:  d  G2    Joan   f  43     201752     0
12:  d  G2    Mary   f  39     201801     0
Amy M
  • 967
  • 1
  • 9
  • 19
  • Seems the problem is taking the max when there are no values. For example `is.double(max(NA, na.rm=TRUE))` return TRUE since it returns -Inf which is represented as a double. What do you want to return when there are no values? NA? maybe `csize := ifelse(sum(!is.na(csize))>0, max(csize, na.rm = T), NA)`? – MrFlick Mar 09 '18 at 20:08
  • When there are no values, the result should be 0 (e.g. in id group 'd' above, there are no males and csize represents the number of males in the group, so the answer should be 0 for all the records in group d). – Amy M Mar 09 '18 at 20:11
  • On, then maybe just `csize := max(c(0L,csize), na.rm = T)` – MrFlick Mar 09 '18 at 20:13
  • Just tried that on my real data and still getting the same error... – Amy M Mar 09 '18 at 20:26
  • Try these two lines: `mydt[sex == "m", csize := as.double(.N), by = id]; mydt[, csize := max(c(csize, 0L), na.rm = TRUE), by = id]` – MKR Mar 09 '18 at 20:35
  • 1
    You could do `mydt[, n := mydt[sex == "m"][.SD, on=.(id), .N, by=.EACHI]$N]` or more simply .. `mydt[, n := sum(sex == "m"), by=id]` – Frank Mar 10 '18 at 02:18
  • 1
    I like the simplicity of @Frank ' s solution although in my real data summing is not always possible (for example in some cases I want to get the maximum date within groups). – Amy M Mar 13 '18 at 15:17

1 Answers1

3

The actual problem is that datatype of the csize. Its of type integer. The max returns double type.

The fix could be:

mydt[sex == "m", csize := as.double(.N), by = id]

mydt[, csize := max(csize, 0, na.rm = TRUE), by = id]
MKR
  • 19,739
  • 4
  • 23
  • 33
  • 1
    Thanks for this - my r session has just crashed, but I don't think that is due to your solution as it seems sensible, so going to accept this answer and assume it will work once I can debug the rest of my script. – Amy M Mar 09 '18 at 20:59
  • Update: it seems that evaluating the conditional statement, e.g. `sex == "m"` gives an integer as the answer (which is then incompatible with .N on the RHS because it was converted to double with `as.double(.N)`). So for counts, @Frank solution is better because it circumvents the issue of having blanks for rows that don't meet the condition, while not causing any class discrepancy between the RHS and LHS: `mydt[, n := sum(sex == "m"), by=id]`. – Amy M Mar 13 '18 at 19:01
  • @AmyM The `sex == "m"` will not play any role in assignment and that is not root cause of error (like `RHS ('double') must..`). You had already checked that `is.integer(mydt$csize)` is `TRUE`. And you can try `is.double(max(mydt$csize))` to find it as `TRUE`. My answer was towards helping you to understand the problem and overcome it. Otherwise there are other ways to solve the same. – MKR Mar 13 '18 at 19:20
  • the reason I posted the update above is because I was getting the same RHS: LHS class discrepancy error as before when applying your solution to my real data - but you are right, I had already checked this and it should not be an issue. I now suspect a bug which throws either this error or crashes R completely when dealing with some combination of missing values, grouping vectors and conditions - have posted an issue [here](https://github.com/Rdatatable/data.table/issues/2672) – Amy M Mar 13 '18 at 19:42
  • @AmyM I'll have a look. Also I was about to add additional info regarding dealing with something like `max date` as you had mentioned in comments. – MKR Mar 13 '18 at 19:44
  • @AmyM I didnt see any benefit to make it 2 step process to change type of `csize`. It can be changed to double in the 1st step itself. Do you see any benefit by having it 2 step process? – MKR Mar 15 '18 at 15:40
  • doing `mydt[ sex == "m", csize := as.double(.N), by = id]` results in the same error as before. I had to update to the devel version of data.table to stop my function from crashing R (this turned out to be due to a known bug with `rbindlist` which is fixed in the devel version) and when I tried running your one-liner it threw an error, which is not produced if you convert to double on a separate line. I think there is still something quite buggy about this behavior although I understand it is supposed to save memory because integers take up less space than double. – Amy M Mar 15 '18 at 16:04
  • @AmyM I understood the reason now. But that's unexpected behavior. At my place error doesnt happen. You can try another way as `mydt[ sex == "m", csize := .N, csize := as.double(csize), by = id]` – MKR Mar 15 '18 at 16:07
  • 1
    the real question is, whether generating such vectors as integers in the first place is ever useful, given that other mathematical operations can't be performed on them. It would be nice if .N worked like a regular function with an optional argument to return integer or double. – Amy M Mar 15 '18 at 16:07
  • @AmyM I think since `.N` is meant to be number of rows hence no harm it being `integer` but it should allow to be transformed. Not sure why that buggy behavior. You can post it to `Github` – MKR Mar 15 '18 at 16:10
  • @AmyM Sorry. I was busy at some other things. Just wanted to say that if you want to edit the answer to 2 steps solution you can do that. I'll approve this time. – MKR Mar 15 '18 at 18:33