3

i have a dcast() application whose cross-product exceeds .Machine$integer.max. is there a recommended alternative to dealing with this situation? i could break up w into smaller pieces, but was hoping for a clean solution.

this might be a duplicate of R error when applying dcast to a large data.table object but that question also doesn't have an answer.

thanks!

library(data.table)

# three million x one thousand
w <- data.table( x = 1:3000000 , y = 1:1000 )

z <- data.table::dcast( w , x ~ y , value.var = 'x' )
# Error in CJ(1:3000000, 1:1000) : 
  # Cross product of elements provided to CJ() would result in 3e+09 rows which exceeds .Machine$integer.max == 2147483647
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77

1 Answers1

3

i guess this solution works if one of your variables is numeric and you also have a sense of the distribution (so can cut it into roughly equal pieces)

library(data.table)

# three million x one thousand
w <- data.table( x = 1:3000000 , y = 1:1000 )

z <- data.table::dcast( w , x ~ y , value.var = 'x' )

w[ , cast_cat := findInterval( y , seq( 100 , 900 , 100 ) ) ]
w_list <- split( w , by = 'cast_cat' )
w_list <- lapply( w_list , function( x ) x[ , cast_cat := NULL ] )
w_list <- lapply( w_list , function( z ) data.table::dcast( z , x ~ y , value.var = 'x' ) )
result <- Reduce( function( ... ) merge( ... , by = 'x' , all = TRUE ) , w_list )
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
  • Even if the variables aren't numeric or follow an unknown distribution, you could simply add a random number column keyed on the lhs of the `equation` used in `dcast`. – Adam Hoelscher Jun 13 '22 at 22:37
  • Can someone explain the `function( ... ) merge( ... , by = 'x', all=TRUE), w_list)` ? – KArrow'sBest Feb 02 '23 at 21:32
  • with the `Reduce` it merges together an unlimited number of data.frame objects rather than only merging two as you would in a normal merge – Anthony Damico Feb 03 '23 at 07:00