3

I am trying to do analysis on the repeat purchase behavior of the customers.

My goal is two-fold:

  1. Eliminate all the customer item zip combination that occurs only once (eliminate the non-repeating ones)
  2. For the records that repeat, I want to summarize the same where I need to get the average number of days between multiple orders and also the total sales

If I have the following data:

Customer#   Item        Zip     Date            Qty     Net Sales
---------   --------    -----   ----------      ---     ---------
ABC123      GHTH123     76137   2014-01-01      10      1500
XYZ999      ZZZZZZZ     68106   2015-02-01      1       50
DEF456      167AAAA     60018   2015-03-01      12      650
XYZ999      YYYYYYY     68106   2015-01-01      3       150 
XYZ999      ZZZZZZZ     68106   2015-04-01      10      500
XYZ999      YYYYYYY     68106   2015-03-01      12      600
XYZ999      YYYYYYY     68106   2015-05-01      10      500 
ABC123      GHTH123     76137   2014-01-15      8       1200
ABC234      N867689     23218   2014-01-01      10      1500
ABC123      DDFF121     76137   2014-01-27      15      2250

I am trying to get the following output:

Customer#   Item        Zip     Avg Days/Ord    Tot Ord     Total Amt
---------   --------    -----   ----------      -------     ---------
ABC123      GHTH123     76137   15              2           2700
XYZ999      ZZZZZZZ     68106   60              2           550
XYZ999      YYYYYYY     68106   60              3           1250

I was trying to use the sqldiff package and use DATEDIFF function but I am not getting anywhere with that as DATEDIFF doesnt work for R.

Can someone help me with a better approach here?

ayushku
  • 43
  • 1
  • 5
  • 1
    How do you calculate the Tot Ord? – akrun May 26 '15 at 07:32
  • Thank you for your help akrun. For the Tot Ord, I made a mistake in the table earlier. I have corrected the same now. Its actually number of times that the combination of Customer, Item, Zip repeated. I have updated the same in the table above. I am also looking how I can use the code that you shared. Thank you again. I will let you know how it works out. – ayushku May 26 '15 at 07:46
  • @ayshuku I posted the comment as an answer. Looks like it matches the expected output except the AvgDays for the 1st which is 14 for me, whereas you have 15 – akrun May 26 '15 at 07:54
  • This is perfect!! This works very well for me!!! Thank you for your help :) – ayushku May 26 '15 at 07:57

1 Answers1

2

You can try

library(dplyr)
df1 %>% 
    group_by(Customer, Item, Zip) %>%
    filter(n()>1) %>% 
    summarise(AvgDays=mean(diff(Date)),TotOrd= n(), TotAmt=sum(NetSales))
#   Customer    Item   Zip AvgDays TotOrd TotAmt
#1   ABC123 GHTH123 76137      14      2   2700
#2   XYZ999 ZZZZZZZ 68106      59      2    550
#3   XYZ999 YYYYYYY 68106      60      3   1250

Or

library(data.table)
setDT(df1)[, if(.N>1) list(AvgDays= mean(c(diff(Date))), TotOrd=.N, 
                  TotAmt=sum(NetSales)), .(Customer, Item, Zip)] 
#   Customer    Item   Zip AvgDays TotOrd TotAmt
#1:   ABC123 GHTH123 76137      14      2   2700
#2:   XYZ999 ZZZZZZZ 68106      59      2    550
#3:   XYZ999 YYYYYYY 68106      60      3   1250
akrun
  • 874,273
  • 37
  • 540
  • 662