1

I am very new to R so I apologize if this question is relatively novice. I have a list of roughly 33,000 rows that shows order ID and products ordered within that order ID. Structured as follows:

Order.ID    Product  
193505  Pineapple Cheddar Burger  
193505  Onion Rings  
193564  Pineapple Cheddar Burger  
193623  Hamburger  
193623  French Fries  
193623  Fountain Soda  
193623  Hot Dog  
193631  Hamburger  
193631  French Fries  
193631  Milkshake  
193644  Daily Special  
193726  Hamburger  
193726  French Fries  
193726  Fountain Soda  
193728  Hamburger  
193728  French Fries  
193728  Fountain Soda  
193738  Hamburger  
193738  French Fries  
193762  Hamburger  
193762  French Fries  
193762  Fountain Soda  
193762  Hamburger  
193762  French Fries  
193762  Fountain Soda

I would like to generate a correlation matrix that reflects the correlation of each item being ordered in the same order ID as each other item. I started by using a table(ID, Product) which gave me products ordered within each order. Now I am stuck on how to go about counting the number of orders of each other product (product 2-k) given product 1 has been ordered. Ideally I would like to run that analysis for every product and turn that into a matrix but evaluating one properly seemed like the logical first step.

I would appreciate any guidance I can get on this problem. Thanks in advance.

Justin
  • 42,475
  • 9
  • 93
  • 111
LFoos24
  • 283
  • 2
  • 6
  • 12
  • 2
    Could you please edit the results of `dput(your_example_data)` into your question? It's not particularly easy to read your example table into R the way it is now. – Marius Feb 18 '13 at 23:07

4 Answers4

5

I'd recommend looking into the arules package (http://cran.r-project.org/web/packages/arules/index.html). Specifically you're looking for frequent item sets:

sets = apriori(asc, parameter=list(target="freq"))

It will be able to figure out all of the combinations. You're going to have to change the format of your data but table() should be a good start.

It will be able to return not just the 2-product distributions but all combinations.

You're looking for the support of a set, so use:

inspect(sets)
kmmats
  • 51
  • 2
  • In experimenting with this function I have only been able to generate single product rules. (Ex. apriori(NB2, parameter=list(supp=0.003,conf=0.05)) -> ap2 lhs rhs support confidence lift 1 {} => {Product=California Burger} 0.07200321 0.07200321 1 2 {} => {Product=French Fries} 0.07520883 0.07520883 1 3 {} => {Product=Make it a Meal} 0.08377770 0.08377770 1 4 {} => {Product=Hamburger} 0.16287026 0.16287026 1 ) Are there no associations or is R not registering order Ids as a factor properly? – LFoos24 Feb 20 '13 at 16:16
  • You're going to have to do some restructuring of your data from the original output you gave above - if it's reading in each line as a separate transaction that would explain why you're only seeing single products. I've primarily used this via unix, but given the very low support and confidence thresholds you've set - I would say this is an issue with the input – kmmats Mar 01 '13 at 07:33
1

If you use dcast from the reshape2 package, you'll get a matrix that you can then do a pairwise correlation function on:

data <- read.table(h=T, text="Order.ID    Product  
+ 193505  'Pineapple Cheddar Burger'  
+ 193505  'Onion Rings'  
+ 193564  'Pineapple Cheddar Burger'  
+ 193623  'Hamburger'  
+ 193623  'French Fries'  
+ 193623  'Fountain Soda'  
+ 193623  'Hot Dog'  
+ 193631  'Hamburger'  
+ 193631  'French Fries'  
+ 193631  'Milkshake'  
+ 193644  'Daily Special'  
+ 193726  'Hamburger'  
+ 193726  'French Fries'  
+ 193726  'Fountain Soda'  
+ 193728  'Hamburger'  
+ 193728  'French Fries'  
+ 193728  'Fountain Soda'  
+ 193738  'Hamburger'  
+ 193738  'French Fries'  
+ 193762  'Hamburger'  
+ 193762  'French Fries'  
+ 193762  'Fountain Soda'  
+ 193762  'Hamburger'  
+ 193762  'French Fries'  
+ 193762  'Fountain Soda'")



library(reshape2)

df <- dcast(data, Order.ID ~ Product)
cor(df[,-1], use = "pairwise")

                         Daily Special Fountain Soda French Fries Hamburger Hot Dog
Daily Special                   1.0000       -0.2868      -0.4375   -0.4375 -0.1250
Fountain Soda                  -0.2868        1.0000       0.8030    0.8030  0.2294
French Fries                   -0.4375        0.8030       1.0000    1.0000  0.1250
Hamburger                      -0.4375        0.8030       1.0000    1.0000  0.1250
Hot Dog                        -0.1250        0.2294       0.1250    0.1250  1.0000
Milkshake                      -0.1250       -0.2868       0.1250    0.1250 -0.1250
Onion Rings                    -0.1250       -0.2868      -0.4375   -0.4375 -0.1250
Pineapple Cheddar Burger       -0.1890       -0.4336      -0.6614   -0.6614 -0.1890
                         Milkshake Onion Rings Pineapple Cheddar Burger
Daily Special              -0.1250     -0.1250                  -0.1890
Fountain Soda              -0.2868     -0.2868                  -0.4336
French Fries                0.1250     -0.4375                  -0.6614
Hamburger                   0.1250     -0.4375                  -0.6614
Hot Dog                    -0.1250     -0.1250                  -0.1890
Milkshake                   1.0000     -0.1250                  -0.1890
Onion Rings                -0.1250      1.0000                   0.6614
Pineapple Cheddar Burger   -0.1890      0.6614                   1.0000
N8TRO
  • 3,348
  • 3
  • 22
  • 40
0

I think this is what you want. First, read in the data:

# Read in data
df<-read.table(
textConnection('
193505,Pineapple Cheddar Burger
193505,Onion Rings
193564,Pineapple Cheddar Burger
193623,Hamburger
193623,French Fries 
193623,Fountain Soda
193623,Hot Dog
193631,Hamburger
193631,French Fries
193631,Milkshake
193644,Daily Special
193726,Hamburger
193726,French Fries 
193726,Fountain Soda
193728,Hamburger
193728,French Fries
193728,Fountain Soda
193738,Hamburger
193738,French Fries
193762,Hamburger
193762,French Fries
193762,Fountain Soda
193762,Hamburger
193762,French Fries
193762,Fountain Soda
')
,sep=',')
names(df)<-c('id','food')

Now, build the matrix.

# Create a matrix of 1's and 0's with id on left and food and column.
id.by.food<-as.matrix(table(df$id,df$food))
# If someone ordered an item twice, you'll get a '2'. 
# Convert everything to 1's.
id.by.food<-pmin(id.by.food,1)
# Get a correlation matrix
round(cor(id.by.food),2)
                         Daily Special Fountain Soda French Fries Hamburger Hot Dog Milkshake Onion Rings
Daily Special                     1.00         -0.32        -0.50     -0.50   -0.12     -0.12       -0.12
Fountain Soda                    -0.32          1.00         0.63      0.63    0.40     -0.32       -0.32
French Fries                     -0.50          0.63         1.00      1.00    0.25      0.25       -0.50
Hamburger                        -0.50          0.63         1.00      1.00    0.25      0.25       -0.50
Hot Dog                          -0.12          0.40         0.25      0.25    1.00     -0.12       -0.12
Milkshake                        -0.12         -0.32         0.25      0.25   -0.12      1.00       -0.12
Onion Rings                      -0.12         -0.32        -0.50     -0.50   -0.12     -0.12        1.00
Pineapple Cheddar Burger         -0.19         -0.48        -0.76     -0.76   -0.19     -0.19        0.66
                         Pineapple Cheddar Burger
Daily Special                               -0.19
Fountain Soda                               -0.48
French Fries                                -0.76
Hamburger                                   -0.76
Hot Dog                                     -0.19
Milkshake                                   -0.19
Onion Rings                                  0.66
Pineapple Cheddar Burger                     1.00
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • This is exactly what I was looking for, thank you for your help. Any suggestions on a good way to display this day? I have 66 products on both axis and its not incredibly useful on the RStudio interface. Thanks again! – LFoos24 Feb 19 '13 at 16:18
  • 1
    I dunno... maybe a heatmap would be a nice way to plot the correlations? There are a lot of examples on the web. Actually, there is a nice question about it [here](http://stackoverflow.com/questions/5453336/plot-correlation-matrix-into-a-graph) – nograpes Feb 19 '13 at 16:34
0

You are looking at Market Basket Analysis (or Affinity Analysis) and in particular defining and finding support, confidence, and lift. One of the best papers explaining it I found here.

arules as mentioned by kmmats is the package to use, nice example of using it is from CrossValidated site here.

Community
  • 1
  • 1
topchef
  • 19,091
  • 9
  • 63
  • 102