0

I have a dataframe with customer-order data that looks like the following:

cust    order number     order_date      total   product_id
 1      1235846868       2020-01-27       20.0   Product A
 1      1235846869       2020-01-27       14.0   Product B
 2      1245485221       2020-05-16       11.1   Product B, Product C, Product D
 3      1285784226       2020-07-10       24.0   Product D
 4      5412151256       2020-03-27       12.0   Product A
 4      5412151290       2020-04-13       23.0   Product C, Product B
 5      5481581554       2020-02-18       12.0   Product D

As you can see above, there are customers (such as customer "1") who have ordered multiple times on the same day (probably since they forgot to put something into their basked on their first order). I would like to aggregate these multiple orders by customers on the same day but but preserve all other columns I have in the dataset (such as the product_id, order_number, etc.). The output table should look something like this:

cust    order number                order_date      total   product_id
 1      1235846868, 1235846869      2020-01-27       34.0   Product A, Product B
 2      1245485221                  2020-05-16       11.1   Product B, Product C, Product D
 3      1285784226                  2020-07-10       24.0   Product D
 4      5412151256                  2020-03-27       12.0   Product A
 4      5412151290                  2020-04-13       23.0   Product C, Product B
 5      5481581554                  2020-02-18       12.0   Product D

Thank you!

emil_rore
  • 115
  • 1
  • 7

1 Answers1

2

Using dplyr one way would be to sum the total column and create a comma-separated string for order_number and product_id columns.

library(dplyr)

df %>%
  group_by(cust, order_date) %>%
  summarise(total = sum(total, na.rm = TRUE), 
            across(c(order_number, product_id), toString))

#  cust order_date total order_number           product_id                
#  <int> <chr>      <dbl> <chr>                  <chr>                     
#1     1 2020-01-27  34   1235846868, 1235846869 ProductA, ProductB        
#2     2 2020-05-16  11.1 1245485221             ProductB,ProductC,ProductD
#3     3 2020-07-10  24   1285784226             ProductD                  
#4     4 2020-03-27  12   5412151256             ProductA                  
#5     4 2020-04-13  23   5412151290             ProductC,ProductB         
#6     5 2020-02-18  12   5481581554             ProductD       
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi Ronak! Thanks again for the very fast reply! I will try it out immediately!. One more question: My dataset consists of 30 columns - is there a shorter way to aggregate these as well or would you say I need to summarise each single one? Thank you! – emil_rore Jul 31 '20 at 09:05
  • 1
    You can use `across` as shown in the answer. You can select multiple columns based on position (2 to 5), or by range of columns (A to D) or based on pattern in their name (starts with A etc). – Ronak Shah Jul 31 '20 at 09:17
  • Thank you! I am running the code now, but I think again due to dataset-size (>2 million observations) it takes ages, unfortunately. Is there an alternative you could think of to the code above? – emil_rore Jul 31 '20 at 09:21