How to convert dataframe into usable format for sequence mining in R?

Question

I'd like to do sequence analysis in R, and I'm trying to convert my data into a usable form for the arulesSequences package.

library(tidyverse)
library(arules)
library(arulesSequences)

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

If leave my columns as their original class as above, I get an error:

Error in asMethod(object) : 
  column(s) 1, 2, 3, 4 not logical or a factor. Discretize the columns first.

However, if I convert the columns to factors, I get another error:

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

df <- as.data.frame(lapply(df, as.factor))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

Error in asMethod(object) :
In makebin(data, file) : 'eventID' is a factor

Any advice on getting around this or advice on sequence mining in R in general is much appreciated. Thanks!

score 3 · Accepted Answer · answered Apr 03 '18 at 14:25

3

Only the actual items (in your case "site") go into the transactions. Always inspect your intermediate results to make sure it looks right. The type of transactions needed for sequence mining is described in ? cspade.

library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

# convert site into itemsets and add sequence and event ids
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
inspect(df.trans)

# sort by sequenceID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
inspect(df.trans)

# mine sequences
seq <- cspade(df.trans, parameter = list(support = 0.2), 
              control = list(verbose = TRUE))
inspect(seq)

Hope this helps!

answered Apr 03 '18 at 14:25

Michael Hahsler

2,965
1
12
16

One more question -- with this toy dataset, cspade runs quickly. This is true for datasets with up to ~200 rows. But a dataset with 500+ takes much longer, and in the case of my actual dataset (~3000), cspade consumes all available memory and never finishes. Any advice? – mowglis_diaper Apr 04 '18 at 12:33
Maybe you need to increase support? – Michael Hahsler Apr 04 '18 at 21:28
1

@MichaelHahsler: Hi! This was a great explanation. However, the items are becoming {site="..."}. How to get rid of the "site=" portion? – lu5er Nov 21 '18 at 04:22
1

@IU5er: You can use `itemLabels(df.trans) <-` to assign new item labels. – Michael Hahsler Nov 25 '18 at 16:18
@prayay You can do this: `itemLabels(df.trans) <- str_replace_all(itemLabels(df.trans), pattern = 'site=', '')` – igorkf Aug 21 '20 at 19:24

How to convert dataframe into usable format for sequence mining in R?

1 Answers1