1
   Server  Epoch      A B C D E
    1 C301 1420100400 1 0 1 0 0
    2 C301 1420100700 0 0 0 0 0
    3 C301 1420152000 0 1 0 0 0
    4 C301 1420238100 1 1 1 0 0
    5 C301 1420324500 1 1 1 1 1

I need help getting the matrix above into basket or transaction form (to use with cSpade algorithm in package arulesSequences) such that every "1" in the matrix is a transaction item. ie, the output would look something like this:

Server    Epoch       #items    Items
C301      1420100400  2         A C
C301      1420152000  1         B
C301      1420238100  3         A B C
C301      1420324500  5         A B C D E

I've written a long function but its not very efficient and very time consuming. It needs to be scalable across huge data sets. thanks for help in advance

qman
  • 11
  • 4

1 Answers1

1

You can try a combination of melt from reshape2 and aggregate. After melting the dataset, isolate the values equaling 1 to aggregate by Server and Epoch. To sum the variables in the column we use length, and toString for the list of Items:

library(reshape2)
m <- melt(df1, c("Server", "Epoch"))
aggregate(variable~Server+Epoch, m[m$value==1,], FUN=function(x) cbind(length(x), toString(x)))
# Server      Epoch variable.1    variable.2
# 1   C301 1420100400          2          A, C
# 2   C301 1420152000          1             B
# 3   C301 1420238100          3       A, B, C
# 4   C301 1420324500          5 A, B, C, D, E
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • thank you for the quick reply! This seems to work pretty well, my only problem now is that it prints fine but it doesnt "view" the same way as it prints. I.e variable.2 ends up under server, epoch, variable.1 instead of in its own separate column like your output – qman Jan 04 '16 at 17:23
  • Save the last object to a variable `a = aggregate(variable.....)` Then try `do.call(data.frame, a)` – Pierre L Jan 04 '16 at 17:35
  • actually when I write it out to a txt file it shows up right, thank you! – qman Jan 04 '16 at 17:37
  • A "data.table" equivalent: `melt(as.data.table(mydf), 1:2)[value == 1, list(sum(value), paste(variable, collapse = " ")), by = .(Server, Epoch)]`. – A5C1D2H2I1M1N2O1R2T1 Jan 04 '16 at 18:10