R item lists to wide format

Question

I have a data frame of item lists, where each row in the data frame contain LHS and RHS association rules with the corresponding support, confidence and lift. here's the data:

structure(list(rules = structure(c(13L, 4L, 28L, 1L, 24L, 15L
), .Label = c("{butter,jam} => {whole milk}", "{butter,rice} => {whole milk}", 
"{canned fish,hygiene articles} => {whole milk}", "{curd,cereals} => {whole milk}", 
"{domestic eggs,rice} => {whole milk}", "{grapes,onions} => {other vegetables}", 
"{hamburger meat,bottled beer} => {whole milk}", "{hamburger meat,curd} => {whole milk}", 
"{hard cheese,oil} => {other vegetables}", "{herbs,fruit/vegetable juice} => {other vegetables}", 
"{herbs,rolls/buns} => {whole milk}", "{herbs,shopping bags} => {other vegetables}", 
"{liquor,red/blush wine} => {bottled beer}", "{meat,margarine} => {other vegetables}", 
"{napkins,house keeping products} => {whole milk}", "{oil,mustard} => {whole milk}", 
"{onions,butter milk} => {other vegetables}", "{onions,waffles} => {other vegetables}", 
"{pastry,sweet spreads} => {whole milk}", "{pickled vegetables,chocolate} => {whole milk}", 
"{pork,butter milk} => {other vegetables}", "{rice,bottled water} => {whole milk}", 
"{rice,sugar} => {whole milk}", "{soups,bottled beer} => {whole milk}", 
"{tropical fruit,herbs} => {whole milk}", "{turkey,curd} => {other vegetables}", 
"{whipped/sour cream,house keeping products} => {whole milk}", 
"{yogurt,cereals} => {whole milk}", "{yogurt,rice} => {other vegetables}"
), class = "factor"), support = c(0.00193187595322827, 0.00101677681748856, 
0.00172852058973055, 0.00101677681748856, 0.00111845449923742, 
0.00132180986273513), confidence = c(0.904761904761905, 0.909090909090909, 
0.80952380952381, 0.833333333333333, 0.916666666666667, 0.8125
), lift = c(11.2352693602694, 3.55786275006331, 3.16819206791352, 
3.26137418755803, 3.58751160631383, 3.17983983286908)), .Names = c("rules", 
"support", "confidence", "lift"), row.names = c(NA, 6L), class = "data.frame")

What I need is to structure these rules into a wide format, where for each item in each LHS part of the rules will have a designated column with a value of 1 (to indicate that rule has that item in its LHD part), the same goes for the RHS of the rules, e.g. taking the 2 first rules:

{liquor,red/blush wine} => {bottled beer} 0.0019 0.90 11.2
{curd,cereals} => {whole milk} 0.0010 0.91 3.6

The result should be a data frame that looks like:

'rules_id' 'lhs_liquor' 'lhs_red/blush wine' 'lhs_curd' 'lhs_cereals' 'rhs_bottled beer' 'rhd_whole milk' 'support' 'confidence' 'lift'
1 1 1 0 0 1 0 0.0019 0.90 11.2
2 0 0 1 1 0 1 0.0010 0.91 3.6

As I am new to R and stack overflow please let me know if the question is not well defined Any help appreciated

Your last paragraph, since it is not part of the question itself, would usually go down here, in the comments. — Frank, Jun 17 '16 at 15:58
A long format would probably be more useful: `df %>% separate(col = rules, into = c('lhs', 'rhs'), sep = ' => ') %>% separate_rows(col = lhs, into = lhs, sep = ',') %>% gather(key = side, value = product, lhs, rhs) %>% mutate(product = gsub('[{}]', '', product))` — alistaire, Jun 17 '16 at 17:26

score 0 · Answer 1 · answered Jun 17 '16 at 16:32

You could do something like

library(dplyr)
library(tidyr)
library(reshape2) 
rules %>% 
  mutate(id = seq_len(n())) %>% 
  separate(rules, c("lhs", "rhs"), "\\} => \\{") %>% 
  separate_rows(lhs) %>% filter(lhs!="") %>% 
  gather(value, var, lhs, rhs) %>% 
  mutate(var=paste(value, sub("}", "", var, fixed=T), sep="_")) %>%
  dcast(id+support+confidence+lift~var, fun.aggregate = function(x) (length(x)>0)+0L)
#   id     support confidence      lift lhs_beer lhs_blush lhs_bottled lhs_butter lhs_cereals
# 1  1 0.001931876  0.9047619 11.235269        0         1           0          0           0
# 2  2 0.001016777  0.9090909  3.557863        0         0           0          0           1
# 3  3 0.001728521  0.8095238  3.168192        0         0           0          0           1
# 4  4 0.001016777  0.8333333  3.261374        0         0           0          1           0
# 5  5 0.001118454  0.9166667  3.587512        1         0           1          0           0
# 6  6 0.001321810  0.8125000  3.179840        0         0           0          0           0
#   lhs_curd lhs_house lhs_jam lhs_keeping lhs_liquor lhs_napkins lhs_products lhs_red
# 1        0         0       0           0          1           0            0       1
# 2        1         0       0           0          0           0            0       0
# 3        0         0       0           0          0           0            0       0
# 4        0         0       1           0          0           0            0       0
# 5        0         0       0           0          0           0            0       0
# 6        0         1       0           1          0           1            1       0
#   lhs_soups lhs_wine lhs_yogurt rhs_bottled beer rhs_whole milk
# 1         0        1          0                1              0
# 2         0        0          0                0              1
# 3         0        0          1                0              1
# 4         0        0          0                0              1
# 5         1        0          0                0              1
# 6         0        0          0                0              1

Feel free to use tidyr's spread instead of reshape2's dcast - I still find dcast more untuitive...

separate_rows() is unknown function. I wonder which package does this function belongs to ? — Nir Regev, Jun 17 '16 at 21:43

score 0 · Accepted Answer · answered Jun 17 '16 at 17:42

You can do this.

dummies <- function(x, prefix) {
    x.names <- unique(unlist(strsplit(x, ',')))
    out <- array(0L, c(nrow(df), length(x.names)), list(NULL, x.names))
    mapply(function(i, val) out[i, val] <<- 1L, 1:nrow(out), strsplit(x, ','))
    if (!missing(prefix))
        colnames(out) <- paste0(prefix, colnames(out))
    out
}

pat <- '[{](.*)[}] => [{](.*)[}]'

cbind(as.data.frame(
    cbind(dummies(sub(pat, '\\1', df$rules), 'lhs.'),
          dummies(sub(pat, '\\2', df$rules), 'rhs.'))),
    df[c('support','confidence','lift')])

Output as follows:

  lhs.liquor lhs.red/blush wine lhs.curd lhs.cereals lhs.yogurt lhs.butter
1          1                  1        0           0          0          0
2          0                  0        1           1          0          0
3          0                  0        0           1          1          0
4          0                  0        0           0          0          1
5          0                  0        0           0          0          0
6          0                  0        0           0          0          0
  lhs.jam lhs.soups lhs.bottled beer lhs.napkins lhs.house keeping products
1       0         0                0           0                          0
2       0         0                0           0                          0
3       0         0                0           0                          0
4       1         0                0           0                          0
5       0         1                1           0                          0
6       0         0                0           1                          1
  rhs.bottled beer rhs.whole milk     support confidence      lift
1                1              0 0.001931876  0.9047619 11.235269
2                0              1 0.001016777  0.9090909  3.557863
3                0              1 0.001728521  0.8095238  3.168192
4                0              1 0.001016777  0.8333333  3.261374
5                0              1 0.001118454  0.9166667  3.587512
6                0              1 0.001321810  0.8125000  3.179840

Awesome! very neat solution. It is a little out of my league to grasp that out and mapply statements, so I would appreciate if you could explain what's going on there, thanks anyway — Nir Regev, Jun 17 '16 at 19:12
@NirRegev Actually, this solution is a little hacky. First problem: `dummies` has a hard-coded reference to `df`. The use of `mapply` here emulates `enumerate()` in Python, iterating over the elements of `1:nrow(out)` and `strsplit(x, ',')`. Also it relies on the `<<-` operator to make an assignment outside of the scope of the anonymous function. — Ernest A, Jun 17 '16 at 20:16

R item lists to wide format

2 Answers2