1

I have a large data.table object in R with 4,847,143 rows. Speed is key so I have mostly implemented operations using library(data.table).

The dt has a structure as follows:

library(data.table)

dt

   nr  group count
1: 1   A     2
2: 1   B     2
3: 2   C     2
4: 2   D     2
5: 2   A     2
6: 3   B     2

When I try and convert this long dt to a wide format using dcast I get the following error:

ndt <- dcast(dt, nr ~ group, fun.aggregate = sum, value.var = 'count')

Error in dim.data.table(x) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138
In addition: Warning message:
In setattr(l, "row.names", .set_row_names(length(l[[1L]]))) :
  NAs introduced by coercion to integer range

When I apply the same function to a subset of the first 2,000,000 rows it works fine:

ndt <- dcast(dt[1:2000000], nr ~ group, fun.aggregate = sum, value.var = 'count')

dim(dt)
[1] 4847143       3

dim(ndt)
[1] 1166035     716

Any help would greatly be appreciated in resolving this or an alternative fast solution.

My data.table version:

> packageVersion('data.table')
[1] ‘1.10.4.3’

Thanks

ZeroStack
  • 1,049
  • 1
  • 13
  • 25
  • 1
    Simulating a `data.table` of that format with 5 million rows does not lead to that problem when using your `dcast()` on it. My version of `data.table` is 1.10.4-3. What is yours? – Martin Schmelzer Dec 15 '17 at 01:38
  • my version of data.table is ‘1.10.4.3’, bear in mind the casting results in over 700 columns on a 2,000,000 subset. On the whole dt object it will be larger. – ZeroStack Dec 15 '17 at 01:39
  • 1
    Could you please paste the result of `str(dt)` into your question. – Martin Schmelzer Dec 15 '17 at 01:41
  • on my machine, `.Machine$integer.max = 2147483647` ~ `2.14e9`, which is on the same order of magnitude as `700*2e6`... – C8H10N4O2 Dec 15 '17 at 02:27

0 Answers0