2

I have a large dataset and I'm trying to mining association rules between the variables.

My problem is that I have 160 variables among which I have to look for the association rules and also I have more than 1800 item-sets.

Furthermore my variables are continuous variables. To mining association rules, I usually used the apriori algorithm, but as is well known, this algorithm requires the use of categorical variables.

Does anyone have any suggestions on what kind of algorithm I can use in this case?

A restricted example of my dataset is the following:

ID_Order   Model     ordered quantity
A.1        typeX     20
A.1        typeZ     10
A.1        typeY     5
B.2        typeX     16
B.2        typeW     12
C.3        typeZ     1
D.4        typeX     8
D.4        typeG     4
...

My goal would be mining association rules and correlation between different products, maybe with a neural network algorithm in R Does anyone have any suggestions on how to solve this problem?

Thanks in advance

Lorenzo Benassi
  • 621
  • 1
  • 8
  • 31

2 Answers2

2

You can create transactions from your dataset like this:

library(dplyr)

This function is used to get the transactions per ID_Order

concat <- function(x) {
  return(list(as.character(x)))

}

Group df by ID_Order and concatenate. pull() returns the concatenated Models in a list.

a_list <- df %>% 
  group_by(ID_Order) %>% 
  summarise(concat = concat(Model)) %>%
  pull(concat)

Set names to ID_Order:

names(a_list) <- unique(df$ID_Order)

Then you can use the package arules:

Get object of transactions class:

transactions <- as(a_list, "transactions")

Extract rules. You can set minimum support and minimum confidence in supp and conf resp.

rules <- apriori(transactions, 
                 parameter = list(supp = 0.1, conf = 0.5, target = "rules"))

To inspect the rules use:

inspect(rules)

And this is what you get:

     lhs              rhs     support confidence lift      count
[1]  {}            => {typeZ} 0.50    0.50       1.0000000 2    
[2]  {}            => {typeX} 0.75    0.75       1.0000000 3    
[3]  {typeW}       => {typeX} 0.25    1.00       1.3333333 1    
[4]  {typeG}       => {typeX} 0.25    1.00       1.3333333 1    
[5]  {typeY}       => {typeZ} 0.25    1.00       2.0000000 1    
[6]  {typeZ}       => {typeY} 0.25    0.50       2.0000000 1    
[7]  {typeY}       => {typeX} 0.25    1.00       1.3333333 1    
[8]  {typeZ}       => {typeX} 0.25    0.50       0.6666667 1    
[9]  {typeY,typeZ} => {typeX} 0.25    1.00       1.3333333 1    
[10] {typeX,typeY} => {typeZ} 0.25    1.00       2.0000000 1    
[11] {typeX,typeZ} => {typeY} 0.25    1.00       4.0000000 1
clemens
  • 6,653
  • 2
  • 19
  • 31
  • Hi @clemens, thanks for your answer is very detailed, but when I tried to run the script I get an error. When I run this part of code `a_list <- df %>% group_by(ID_Order) %>% summarise(concat = concat(Model)) %>% pull(concat)` I obtain this error: `Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘pull’ for signature ‘"tbl_df"’` I saw that it was due to a conflict between packages, maybe git2r and dplyr, but I could not solve the problem, how did you get it? Thanks – Lorenzo Benassi Nov 08 '17 at 14:00
  • I cannot reproduce your error, but you can try to explicitly use `dplyr::pull(concat)` instead of `pull(concat)`. I am using R 3.4.1 and dplyr 0.7.4 – clemens Nov 08 '17 at 14:55
  • Thanks! Very helpful! – Lorenzo Benassi Nov 08 '17 at 17:14
1

From the example section of ? transactions:

## example 4: creating transactions from a data.frame with 
## transaction IDs and items (by converting it into a list of transactions first) 
a_df3 <- data.frame(
  TID = c(1,1,2,2,2,3), 
  item=c("a","b","a","b","c","b")
  )
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
trans4
inspect(trans4)
Michael Hahsler
  • 2,965
  • 1
  • 12
  • 16