R data cube define hierarchy

Question

I have some problems with the OLapCube package, data.cube:

install.packages("data.cube", repos = paste0("https://", c(
    "jangorecki.gitlab.io/data.cube",
    "cloud.r-project.org"
)))

Some test data:

 library(data.table)
 set.seed(42)

 dt <- CJ(color = c("green","yellow","red"),
            year = 2011:2015,
            month = 1:12,
            status = c("active","inactive","archived","removed")
 )[sample(600)]

 dt[, "value" := sample(4:7/2, nrow(dt), TRUE)]

Now I would like to create a cube and apply a hierarchy on the time dimensions. Something like this:

library(data.cube)
dc <- as.data.cube(dt, id.vars = c("color", "year", "month", "status"), 
                   measure.vars = "value", 
                   hierarchies = list(time <- list("year, month")))

If I run this code i get the error:

Error in as.data.cube.data.table(dt, id.vars = c("color", "year", "month",  : 
  identical(names(hierarchies), id.vars) | identical(names(hierarchies),  .... is not TRUE

If i try something like

hierarchies = list(time <- list("year, month"), color <- list("color"), 
                  status <- list("status"))

i get the same error.

jangorecki · Accepted Answer · 2018-10-15T15:58:19.317

Very well written question.
I see you made example based on ?as.data.cube examples so I will try to answer your question using that examples too

# Original example goes as follows
library(data.cube)
library(data.table)
set.seed(1L)
dt = CJ(color = c("green","yellow","red"),
        year = 2011:2015,
        status = c("active","inactive","archived","removed"))[sample(30)]
dt[, "value" := sample(4:7/2, nrow(dt), TRUE)]

dc = as.data.cube(
  x = dt, id.vars = c("color","year","status"),
  measure.vars = "value",
  hierarchies = sapply(c("color","year","status"),
                       function(x) list(setNames(list(character()), x)),
                       simplify=FALSE)
)
str(dc)

Your error seems to be raised when checking validity of hierarchies.
Unfortunately it is not very meaningful error, I created issue #18 so this will get improved one day.
So lets compare hierarchies from manual and those created in your example.

sapply(c("color","year","status"),
       function(x) list(setNames(list(character()), x)),
       simplify=FALSE) -> h
str(h)
#List of 3
# $ color :List of 1
#  ..$ :List of 1
#  .. ..$ color: chr(0) 
# $ year  :List of 1
#  ..$ :List of 1
#  .. ..$ year: chr(0) 
# $ status:List of 1
#  ..$ :List of 1
#  .. ..$ status: chr(0)     

hierarchies = list(time <- list("year, month"), color <- list("color"), 
                   status <- list("status"))
str(hierarchies)
#List of 3
# $ :List of 1
#  ..$ : chr "year, month"
# $ :List of 1
#  ..$ : chr "color"
# $ :List of 1
#  ..$ : chr "status"

We can see that hierarchies in manual is a list of named elements, and your example is a list of unnamed elements.
I believed you misused <- where = should be used. <- are not always equal to = operator. You can read more about exactly such case in 3.1.3.1 Assignment <- vs =.

So lets see if fixing that is sufficient

hierarchies = list(time = list(c("year, month")), color = list("color"), 
                   status = list("status"))

dc <- as.data.cube(dt, id.vars = c("color", "year", "month", "status"), 
                   measure.vars = "value", 
                   hierarchies = hierarchies)

We still have the same error, so names while were required, where not the root cause of the issue. After taking closer look I see now you want to build time dimension not having primary key for it.
Important note that you cannot pass multiple column names as single string thus

"year, month"

should be written as

c("year","month")

Still we need time dimension primary key to be single field, to which year and month will be just attributes.
So lets make primary key for time dimension then, as our time dimension has year-month granularity we will create key on that granularity.

library(data.table)
set.seed(42)

dt <- CJ(color = c("green","yellow","red"),
         year = 2011:2015,
         month = 1:12,
         status = c("active","inactive","archived","removed")
)[sample(600)
  ][, yearmonth:=sprintf("%04d%02d", year, month) # this ensure four numbers for year and 2 numbers for month
    ]

dt[, "value" := sample(4:7/2, nrow(dt), TRUE)]

Now lets do hierarchies, note that year has been changed to yearmonth. In below hierarchies a vector of values c("year","month") means that those attributes are dependent on yearmonth. Please see more examples in ?as.data.cube for more complex cases of hierarchies.

hierarchies = list(
  color = list(color = list(color = character())),
  yearmonth = list(yearmonth = list(yearmonth = c("year","month"))),
  status = list(status = list(status = character()))
)

dc = as.data.cube(
  x = dt, id.vars = c("color","yearmonth","status"),
  measure.vars = "value",
  hierarchies = hierarchies
)
str(dc)

Our data.cube has been successfully created. Lets try to query it using key of yearmonth

dc[, .(yearmonth=201105L)] -> d
as.data.table(d)
dc[, .(yearmonth=201105L), drop=FALSE] -> d
as.data.table(d)

Now try to query it using attributes of dimension, a year, and a month, and both

dc[, .(year=2011L)] -> d
as.data.table(d) # note that dimension is not being dropped because it still have more than 1 value
dc[, .(month=5L)] -> d
as.data.table(d)
dc[, .(year=2011L, month=5L)] -> d
as.data.table(d) # here dimension has been dropped because there was only single element in that dimension, you can of course use `drop=FALSE` if needed.

Hope that helps, good luck!

R data cube define hierarchy

1 Answers1