0

Using R, is there a way to reformat a dataset in a way similar to doing a pivot table in excel? My data has 5 variables in total. Three variables are Date, Channel and Category, and there are two metric variables Views and Spend. Is there a way that I can generate timeseries data, whereby I have the Date in rows, and to auto-generate new variables based on permutation of Channel & Category, for each of metric variables, Views and Spend? This question is different from other questions, because I want the variable names to be part of the variable name.

The start file looks like this

Date=c("01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/2020","08/01/2020","08/01/2020","15/01/2020","15/01/2020","15/01/2020","15/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","22/01/2020","29/01/2020","29/01/2020","05/02/2020","05/02/2020","05/02/2020")
Channel=c("TV","TV","TV","Internet","TV","Internet","TV","Internet","TV","TV","Internet","TV","Internet","TV","TV","Internet","TV","TV","Internet","TV","Internet","TV","Internet")
Category=c("CatA","CatA","CatA","CatA","CatB","CatB","CatB","CatB","CatA","CatB","CatB","CatA","CatB","CatB","CatB","CatB","CatB","CatB","CatB","CatA","CatA","CatA","CatA")
Views=c(190,320,260,300,240,190,200,190,230,30,370,260,350,240,330,190,290,220,230,180,230,310,270)
Spend=c(34,63,46,53,21,23,17,24,20,5,50,42,46,39,44,31,72,54,58,22,29,41,36)
df <- data.frame(Date,Channel,Category,Views,Spend)
df

> df
         Date  Channel Category Views Spend
1  01/01/2020       TV     CatA   190    34
2  01/01/2020       TV     CatA   320    63
3  01/01/2020       TV     CatA   260    46
4  01/01/2020 Internet     CatA   300    53
5  01/01/2020       TV     CatB   240    21
6  08/01/2020 Internet     CatB   190    23
7  08/01/2020       TV     CatB   200    17
8  15/01/2020 Internet     CatB   190    24
9  15/01/2020       TV     CatA   230    20
10 15/01/2020       TV     CatB    30     5
11 15/01/2020 Internet     CatB   370    50
12 22/01/2020       TV     CatA   260    42
13 22/01/2020 Internet     CatB   350    46
14 22/01/2020       TV     CatB   240    39
15 22/01/2020       TV     CatB   330    44
16 22/01/2020 Internet     CatB   190    31
17 22/01/2020       TV     CatB   290    72
18 22/01/2020       TV     CatB   220    54
19 29/01/2020 Internet     CatB   230    58
20 29/01/2020       TV     CatA   180    22
21 05/02/2020 Internet     CatA   230    29
22 05/02/2020       TV     CatA   310    41
23 05/02/2020 Internet     CatA   270    36

I would like the reformatted dataframe to look like this

Date=c("01/01/2020","08/01/2020","15/01/2020","22/01/2020","29/01/2020","05/02/2020")
TV.CatA.Views=c(770,0,230,260,180,310)
TV.CatB.Views=c(240,200,30,1080,0,0)
Internet.CatA.Views=c(300,0,0,0,0,500)
Internet.CatB.Views=c(0,190,560,540,230,0)
TV.CatA.Spend=c(143,0,20,42,22,41)
TV.CatB.Spend=c(21,17,5,209,0,0)
Internet.CatA.Spend=c(53,0,0,0,0,65)
Internet.CatB.Spend=c(0,23,74,77,58,0)
df_result <- data.frame(Date,TV.CatA.Views,TV.CatB.Views,Internet.CatA.Views,Internet.CatB.Views,TV.CatA.Spend,TV.CatB.Spend,Internet.CatA.Spend,Internet.CatB.Spend)
df_result 

> df_result 
        Date TV.CatA.Views TV.CatB.Views Internet.CatA.Views Internet.CatB.Views TV.CatA.Spend
1 01/01/2020           770           240                 300                   0           143
2 08/01/2020             0           200                   0                 190             0
3 15/01/2020           230            30                   0                 560            20
4 22/01/2020           260          1080                   0                 540            42
5 29/01/2020           180             0                   0                 230            22
6 05/02/2020           310             0                 500                   0            41
  TV.CatB.Spend Internet.CatA.Spend Internet.CatB.Spend
1            21                  53                   0
2            17                   0                  23
3             5                   0                  74
4           209                   0                  77
5             0                   0                  58
6             0                  65                   0

The variable names don't need to be exactly how I've specified above, just as long as it's possible to recognise what those levels are in the variable. Currently, I've been doing this in excel but after doing over 50 of them in succession, I need to find a more efficient way. Thanks for taking time to look at my question, any help is greatly appreciated.

H.Cheung
  • 855
  • 5
  • 12
  • 1
    you can have a look at `pivot_wider` with the library `tidyr`, or `dcast` from the library `data.table` – denis Aug 07 '20 at 13:58
  • @denis thanks, similar but not quite the same. I want to also auto generate the variable names to include the Channel/Category/Metric – H.Cheung Aug 07 '20 at 14:36

1 Answers1

2

This code produces something similar to what you want, using df you added:

library(tidyverse)
#Code
mdf <- df %>% group_by(Date,Channel,Category) %>% summarise_all(.funs = sum) %>%
  ungroup() %>% pivot_wider(names_from = c(Channel,Category),values_from = c(Views,Spend))

Output:

        Date Views_Internet_CatA Views_TV_CatA Views_TV_CatB Views_Internet_CatB Spend_Internet_CatA
1 01/01/2020                 300           770           240                  NA                  53
2 05/02/2020                 500           310            NA                  NA                  65
3 08/01/2020                  NA            NA           200                 190                  NA
4 15/01/2020                  NA           230            30                 560                  NA
5 22/01/2020                  NA           260          1080                 540                  NA
6 29/01/2020                  NA           180            NA                 230                  NA
  Spend_TV_CatA Spend_TV_CatB Spend_Internet_CatB
1           143            21                  NA
2            41            NA                  NA
3            NA            17                  23
4            20             5                  74
5            42           209                  77
6            22            NA                  58
Duck
  • 39,058
  • 13
  • 42
  • 84
  • 1
    OP could also use `summarise(across(.fns = sum))` since `summarise_all()` has been superseded by across, right? – Eric Aug 07 '20 at 14:46
  • @Duck, the numbers are incorrect using this method. Compare to my df_result the numbers do not match – H.Cheung Aug 07 '20 at 14:48
  • 1
    @H.Cheung I have checked, except for `NA` that can be replaced by zero, all values are equal. The position of variables changed and differ from yours but results are equal. Also in `df` you have date `05/02/2020` but in your final result is not present. I think is a typo. – Duck Aug 07 '20 at 14:57
  • @Duck, thanks apologies my mistake. I have now edited the question to correct the error, i didn't add the final row when creating code for df_result. Your formula worked, but because the table wasn't ordered by date, this is why it looked different to me. I can re-order and remove NAs. Many thanks for your help – H.Cheung Aug 07 '20 at 15:17
  • @H.Cheung Great it worked!! Nice coding for you :) – Duck Aug 07 '20 at 15:18