What would be the general formula be for describing this savings/ difference?
Items to consider for the calculation:
- Number of array dimensions (identifying columns)
- The size of each dimension (unique elements in each identifying column)
- The product of the dimension sizes (number of array elements, unique combinations of the above)
- The
class()
of each the elements of each dimension, when stored in a 2D as opposed to array format (character, integer, integer64, factor, double, very likely others as well)
I'm writing the documentation for a function in an R package (semi in-house package), and I want to adequately describe this point. Depending on what I (or you!) can come up with here, I might even write a function to calculate this difference so the user can see the savings before having to try it both ways (the data set is quite large!).
Edit:
# starting object
d2 <- data.table(v=rnorm(10))
d2[,c("a","b","d","e"):=replicate(4, sample(1:20, 10), simplify=FALSE)]
setkey(d2, a, b, d, e)
# two casts to compare
d2.cast <- d2[CJ(a,b,d,e)] # 2D structure
dN.cast <- reshape2::acast(d2, a~b~d~e, value.var="v") # N-D structure
# compare sizes
print(object.size(d2.cast), units="Kb")
print(object.size(dN.cast), units="Kb")
print(object.size(d2.cast), units="Kb")
236.4 Kb
print(object.size(dN.cast), units="Kb")
81 Kb
And please, if I'm using poor terminology, correct it. I'd love to better-describe this situation :)