0

What would be the general formula be for describing this savings/ difference?

Items to consider for the calculation:

  • Number of array dimensions (identifying columns)
  • The size of each dimension (unique elements in each identifying column)
  • The product of the dimension sizes (number of array elements, unique combinations of the above)
  • The class() of each the elements of each dimension, when stored in a 2D as opposed to array format (character, integer, integer64, factor, double, very likely others as well)

I'm writing the documentation for a function in an R package (semi in-house package), and I want to adequately describe this point. Depending on what I (or you!) can come up with here, I might even write a function to calculate this difference so the user can see the savings before having to try it both ways (the data set is quite large!).

Edit:

# starting object  
d2 <- data.table(v=rnorm(10)) 
d2[,c("a","b","d","e"):=replicate(4, sample(1:20, 10), simplify=FALSE)]
setkey(d2, a, b, d, e)

# two casts to compare
d2.cast <- d2[CJ(a,b,d,e)] # 2D structure
dN.cast <- reshape2::acast(d2, a~b~d~e, value.var="v") # N-D structure

# compare sizes
print(object.size(d2.cast), units="Kb")
print(object.size(dN.cast), units="Kb")

print(object.size(d2.cast), units="Kb")
236.4 Kb
print(object.size(dN.cast), units="Kb")
81 Kb

And please, if I'm using poor terminology, correct it. I'd love to better-describe this situation :)

jangorecki
  • 16,384
  • 4
  • 79
  • 160
rbatt
  • 4,677
  • 4
  • 23
  • 41
  • This might be useful: http://adv-r.had.co.nz/memory.html – Molx Nov 13 '15 at 21:23
  • Also `array` seems to be a bit smaller, but not that much. Just a simple test: `object.size(lapply(1:100, function(i) lapply(1:100, function(j) rnorm(100))))` and `object.size(array(rnorm(100^3), dim = c(100, 100, 100)))`. (8484840 and 8000208 bytes, respectively). – Molx Nov 13 '15 at 21:29
  • @Molx see clarifying edit, with examples. The 2D object is much larger because it has to repeat the information that's only stated once by defining the array's dimensions. – rbatt Nov 13 '15 at 21:45
  • In contrast with Jan's answer, I'd say: use the right tool for the job. If you are performing matrix algebra, you'd best be using arrays. If you care about memory and yet aren't using matrix algebra, maybe you should be. Use a data.table whenever you have something akin to observations, especially if you have categorical/grouping variables. And use a list or simple vector where appropriate. All of these classes are handy for analysis. I'm not posting this as an answer since it's essentially just an opinion. – Frank Nov 13 '15 at 23:54
  • I do everything in data.table until I need to pass things off to Stan or JAGS for more analysis. The 0's are needed, but in those languages, NA's must be skipped, so I have to define those nodes as such ahead of time (well, I need to model them, basically). Point: I use data.table, but when I go to do the stats, AFAIK it makes sense to use the array. Could do a 2D thing, but exploding out all the 0's and NA's in 2D is inefficient. – rbatt Nov 15 '15 at 00:24

1 Answers1

2

You should be able calculate size of the array by prod(dim(ar)) * amount of bytes.
If your data has to keep values for all cross dimensions (so CJ in data.table) and just a single measure then it is better to use array.
On the other hand, having data in 2D you can remove NA values from all redundant dimension crosses. It can dramatically reduce amount of RAM required, in many cases allowing multidimensional data to scale enough to be analysed.
2D modelled data can store multiple measures unlike array.
2D looks more friendly in terms of partitioning and distributed computing.

So it totally depends on the data, but IMO in most cases array doesn't scale at all while 2D table or star schema set of 2D tables scales pretty well.
If you want to dive a little more into this you can check my in dev package data.cube designed to scale multidimensional data with star schema modelled data.table's.


Additionally defining star schema as a simple tool to store and process multidimensional data.

Central object of the star schema, a fact table:

#      prod_name  time_date geog_abb amount     value
# 1: AMC Javelin 2010-01-02       AK  23.64 5193.2088
# 2: AMC Javelin 2010-01-02       MD  88.02 1559.0968

Refers to 3D array, where dimensions are product time and geography, and two measures.
Arrays doesn't drive hierarchy of attributes for dimensions, just it's character natural key or integer index.
Tabular structure allows to create lookup tables for each of the dimension key in fact table.
Resulting into 3 dimension tables:

$dims$product
#              prod_name prod_cyl prod_vs prod_am prod_gear
# 1:         AMC Javelin        8       0       0         3
# 2:  Cadillac Fleetwood        8       0       0         3

$dims$time
#     time_date time_month time_month_name time_quarter time_quarter_name time_year
# 1: 2010-01-01          1         January            1                Q1      2010
# 2: 2010-01-02          1         January            1                Q1      2010

$dims$geography
#    geog_abb      geog_name geog_division_name geog_region_name
# 1:       AK         Alaska            Pacific             West
# 2:       AL        Alabama East South Central            South
# 3:       AR       Arkansas West South Central            South

Later when accessing data you can refer higher level attributes to analyse your data, join is automatically handled by the tool.
This is basic star schema, also a simple way to remove NA values from cross product of all dimensions.
Having defined a hierarchy in each dimension you can do much more.
In the data.cube package you can use populate_star(1e5) to produce a sales fact table and 5 dimensions.


Few memory consumption tests are available in the package vignette Subset multidimensional data in the last section.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • Maybe you could provide a ref for "star schema"..? I have not heard of it. Nice website, btw :) – Frank Nov 13 '15 at 23:56
  • @Frank I've edited the answer, let me know if something is not clear :) – jangorecki Nov 14 '15 at 00:58
  • Cool thanks. I just remember "normalization" from database design. The star schema you describe here is also roughly (I think) how I organize data. – Frank Nov 14 '15 at 01:05
  • 1
    Using data.table's fast join we could normalize data into snowflake schema, so put each dimension level in its own table. This requires a lot more joins, it also reduces size of dimension tables. – jangorecki Nov 14 '15 at 01:21
  • I actually need the NA values. Some of them become 0's (species occurrence data), others are true NA's. The 0's are data that need to be modeled, and the NA's are sometimes missing values that need to be modeled too. I'm passing this object to JAGS or Stan. – rbatt Nov 15 '15 at 00:27
  • @rbatt so you can keep NAs value rather then removing the row. In array you don't have a choice – jangorecki Dec 22 '18 at 05:47