3

I am very new to R and sqldf and can't seem to solve a basic problem. I have a file with transactions where each row represents a product purchased.

The file looks like this:

customer_id,order_number,order_date, amount, product_name
1, 202, 21/04/2015, 58, "xlfd"
1, 275, 16//08/2015, 74, "ghb"
1, 275, 16//08/2015, 36, "fjk"
2, 987, 12/03/2015, 27, "xlgm"
3, 376, 16/05/2015, 98, "fgt"
3, 368, 30/07/2015, 46, "ade"

I need to find the maximum amount spent in a single transaction (same order_number) by each customer_id. For example in case of customer_id "1" it would be (74+36)=110.

Jaap
  • 81,064
  • 34
  • 182
  • 193

3 Answers3

4

Assuming the dataframe is named orders, following will do the job:

sqldf("select customer_id, order_number, sum(amount) 
       from orders 
       group by customer_id, order_number")

Update: using nested query the following will give the desired output:

sqldf("select customer_id, max(total) 
       from (select customer_id, order_number, sum(amount) as total 
             from orders 
             group by customer_id, order_number) 
       group by customer_id")

Output:

    customer_id max(total)
1           1        110
2           2         27
3           3         98
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
Imran Ali
  • 2,223
  • 2
  • 28
  • 41
  • 2
    This returns total amount spent per user per purchase, while required output seems just the maximum amount in a single purchase out of all purchases for a user. Maybe take this output and extract `customer_id, max(sum(amount))` with `group by customer_id` ? – Aramis7d Feb 01 '17 at 08:49
  • @Elena Berrone, please accept the answer, see [What should I do when someone answers my question?](http://stackoverflow.com/help/someone-answers) – Imran Ali Feb 02 '17 at 09:48
4

In case sqldf is not a strict requirement.

Considering your input as dft , you can try:

require(dplyr)
require(magrittr)
dft %>% 
    group_by(customer_id, order_number) %>% 
    summarise(amt = sum(amount)) %>% 
    group_by(customer_id) %>% 
    summarise(max_amt = max(amt))

which gives:

Source: local data frame [3 x 2]
Groups: customer_id [3]

  customer_id max_amt
        <int>   <int>
1           1     110
2           2      27
3           3      98
Aramis7d
  • 2,444
  • 19
  • 25
1

We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'customer_id', 'order_number', we get the sum of 'amount', do a second group by with 'customer_id' and get the max of 'Sumamount'

library(data.table)
setDT(df1)[, .(Sumamount = sum(amount)) , .(customer_id, order_number)
        ][,.(MaxAmount = max(Sumamount)) , customer_id]
#   customer_id MaxAmount
#1:           1       110
#2:           2        27
#3:           3        98

Or making it more compact, after grouping by 'customer_id', we split the 'amount' by 'order_number', loop through the list, get the sum, find the max to get the 'MaxAmount'

setDT(df1)[, .(MaxAmount = max(unlist(lapply(split(amount,
                      order_number), sum)))), customer_id]
#   customer_id MaxAmount
#1:           1       110
#2:           2        27
#3:           3        98

Or using base R

aggregate(amount~customer_id, aggregate(amount~customer_id+order_number, 
                         df1, sum), FUN = max)
akrun
  • 874,273
  • 37
  • 540
  • 662