0

I have a data.table object of dimension 421,570 x 16 which stores data for multiple stores. Stores can have different department. This is only a made up example. The dataset is bigger.

> head(raw.df)
   Store Type StoreSize Dept       Date Weekly_Sales
1:     1    A    151315    1 2010-02-05        24925
2:     1    A    151315    2 2010-02-05        46039
3:     2    A    152825    1 2010-02-05        41596
4:     2    A    152825    2 2010-02-05        19404
5:     3    B    110025    1 2010-02-05        21828
6:     3    B    110025    2 2010-02-05        21043

Ideally I want to create an array object which can store multiple matrix, on each matrix I want to have a single store. Basically I want to have on each level of the array the weekly sales on each department so I can run some time series analyses on the array object without running the same function multiple time.

This command will produce the kind of matrix I want for one layer of an array.

dcast(raw.df[which(raw.df$Store == 1), ], Date ~ Dept, value.var = "Weekly_Sales")
head(e1)

I was thinking to generate an empty array and fill it using a for loop. I wrote this loop, which unfortunately doesn't work. The dimension of the array are 143 (days), 99 (departments for each store), 45 (stores)

ts.a <- array(data = NA, dim = c(143, 99, 45))
for (i in 1:45) {
  # generate 45 matrices, one for each store
  paste("mat", i, sep = "") <- matrix(data = NA, nrow = 143, ncol = 99)
  paste("mat", i, sep = "") <- dcast(raw.df[which(raw.df$Store == i), ], Date ~ Dept, 
                                     value.var = "Weekly_Sales")

  # merge the matrix into the array object    
}

I realise my approach might be completely wrong. My knowlegde or R and programming are completely self-taught.

Gianluca
  • 6,307
  • 19
  • 44
  • 65
  • 1
    It is a little unclear to me why you seem to think that a three dimensional array will be the most efficient way to store this data. Can you elaborate on that a bit? – joran Apr 28 '14 at 21:38
  • 1
    Rather than ask for a specific solution to your problem, why don't you tell us what your actual problem is? `data.table` is quite versatile for this type of analysis (applying similar computations to multiple groups of data). – BrodieG Apr 28 '14 at 22:55
  • 1
    I agree with the previous comments and, also, see `?xtabs` as a future utility; e.g. `xtabs(Weekly_Sales ~ Date + Dept + Store, DF)` – alexis_laz Apr 28 '14 at 23:07

2 Answers2

0

Since you did not provide easily usable data, I'll use the french_fries dataset shipped with reshape2:

str(french_fries)
acast(data=french_fries, time~treatment~subject, value.var='potato', fun.aggregate=mean)

Which gives an arrray of time vrs. treatment, with a slice for each subject.

So the equivalent with your data would be something like:

acast(data=raw_df, Date~Dept~Store, value.var='potato')

If you have multiple values for at least one Date~Dept~Store combination, you will have to provide a function to aggregate the data (the default is length, with a warning).

Oscar de León
  • 2,331
  • 16
  • 18
0

What kind of analysis are you looking to do? You might be better off keeping the original data.frame and using ddply from the plyr package to perform operations by certain groups. I frequently use this for timeseries analysis across various subsets. Consider the following example,

dat<-data.frame(store=sample(1:3, 1000, T),
                type=sample(1:3, 1000, T),
                date=sample(1:100, 1000, T),
                val=rnorm(1000))
ddply(dat, .(store, type), # we can run a regression for our subgroups
      function(x){
        model<-lm(date ~ val, data=x)
        c(cf=coef(model), num=nrow(x), adjr2=summary(model)$adj.r.squared)
      })
JPC
  • 1,891
  • 13
  • 29