3

I would like to add seasonality dummies in my R data.table based on quarters. I have looked at multiple examples but I haven't been able to solve this issue yet. My knowledge about R is limited so I was wondering if you could get me on the right track.

My data.table looks like this:

    Year_week  artist_id  number_of_events number_of_streams
   1:     16/30    8296         1            957892
   2:     16/33    8296         6            882282
   3:     16/34    8296         5            926037
   4:     16/35    8296         2            952704
   5:     15/37    17879        1             89515
   6:     16/22    22690        2            119653

What I would like to have is a format like this:

 Year_week  artist_id  number_of_events number_of_streams Q2 Q3 Q4
   1:     16/50    8296         1            957892        0  0  1       
Sotos
  • 51,121
  • 6
  • 32
  • 66
MRJJ17
  • 117
  • 1
  • 1
  • 8

3 Answers3

4

Two approaches:

1) Using dcast, cut and sub:

dcast(DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
                    breaks = c(0,13,26,39,53),
                    labels = paste0('Q',1:4))],
      Year_week + artist_id + number_of_events + number_of_streams ~ Q,
      value.var = 'Q',
      drop = c(TRUE,FALSE),
      fun = length)

gives:

   Year_week artist_id number_of_events number_of_streams Q1 Q2 Q3 Q4
1:     15/37     17879                1             89515  0  0  1  0
2:     16/22     22690                2            119653  0  1  0  0
3:     16/30      8296                1            957892  0  0  1  0
4:     16/33      8296                6            882282  0  0  1  0
5:     16/34      8296                5            926037  0  0  1  0
6:     16/35      8296                2            952704  0  0  1  0

What this does:

  • as.integer(sub('.*/','',Year_week)) extacts the weeknumber from the Year_week column
  • Using cut you divide that into quarters with the appropriate labels (see also ?cut)
  • With dcast you transpose the quarter column to wide format with a aggregation function (length). By using drop = c(TRUE,FALSE) in the dcast function, you make sure that all quarters are included.

Notes:

  • The Q-column is an ordered factor, so you can use that to arrange and filter your data as well.
  • Depending on the use of the dummy columns: you don't always need those. When you want to use them as grouping or filtering variables, you can just work with the Q variable.
  • However, some statistical tests require dummy variables (which justifies the dcast step).

2) Using cut, sub and lapply:

DT[, Q := cut(as.integer(sub('.*/','',Year_week)),
              breaks = c(0,13,26,39,53),
              labels = paste0('Q',1:4))
   ][, paste0('Q',1:4) := lapply(paste0('Q',1:4), function(q) as.integer(q == Q))][]

which gives a similar result. Instead of transposing with dcast, you just check whether one of the quarter labels is in the Q-column.


Used data:

DT <- fread(' Year_week  artist_id  number_of_events number_of_streams
     16/30    8296         1            957892
     16/33    8296         6            882282
     16/34    8296         5            926037
     16/35    8296         2            952704
     15/37    17879        1             89515
     16/22    22690        2            119653')
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    Or if you want to add an external library, `qdapTools::mtabulate(cut(as.integer(sub('.*/','',df$Year_week)), c(0,13,26,39,52), paste0('Q',1:4)))` – Sotos Jul 10 '17 at 12:01
  • Thanks for your solution! When I run the code I get the following message: Using 'Q' as value column. Use 'value.var' to override. Does this cause any issues? Furthermore, the code adds an extra collumn called NA. Do you know why this happens? – MRJJ17 Jul 10 '17 at 12:18
  • @JLMR17 It gives the same result as explicitely specifying it with `value.var = 'Q'` (the latter is better imo to avoid confusion). Also I updated the answer to make sure all quarters are included. HTH – Jaap Jul 10 '17 at 12:25
  • @JLMR17 My first guess for the NA-values might that there are weeks with 53 as a number. I've updated the code to include those cases as well. – Jaap Jul 10 '17 at 12:28
1

I assumed that Year_week is where we can extract the date of the entry.

library(data.table)

whichQuart <- function(x){
  data.frame(+(x <= 13),
    +(x >13 & x <= 26),
    +(x > 26 & x <= 39),
    +(x > 39 & x <= 52))
}

dt <-     setDT(read.table(text="Year_week  artist_id  number_of_events number_of_streams
1:     16/30    8296         1            957892
2:     16/33    8296         6            882282
3:     16/34    8296         5            926037
4:     16/35    8296         2            952704
5:     15/37    17879        1             89515
6:     16/22    22690        2            119653", header=TRUE, stringsAsFactors=FALSE))

dt[, week := strsplit(Year_week, "/")[2]]  
dt[, c("Q1", "Q2", "Q3", "Q4") := whichQuart(week)]

#   Year_week artist_id number_of_events number_of_streams week Q1 Q2 Q3 Q4
#1:     16/30      8296                1            957892   16  0  1  0  0
#2:     16/33      8296                6            882282   33  0  0  1  0
#3:     16/34      8296                5            926037   16  0  1  0  0
#4:     16/35      8296                2            952704   33  0  0  1  0
#5:     15/37     17879                1             89515   16  0  1  0  0
#6:     16/22     22690                2            119653   33  0  0  1  0
emilliman5
  • 5,816
  • 3
  • 27
  • 37
-3

add a quarter column to your df

df$quarter <- as.factor(df$quarter)
df <- cbind(df, model.matrix(~quarter, df))

Hopefully this works!

Kalees Waran
  • 659
  • 6
  • 13
  • 2
    Where did quarter come from? – Sotos Jul 10 '17 at 11:53
  • those are some if statements, like if week number is less than 13, quarter would be 1. 13 - 26, would be quarter 2 and so on till quarter 4. I did not want to implement that here, I left it for OP to do it himself. It was not even the main objective of the program. Main objective was to create a dummy variable and bind it to the dataframe and I addressed it. @Sotos – Kalees Waran Jul 10 '17 at 11:58
  • and using model.matrix is more general way of creating dummy variables. – Kalees Waran Jul 10 '17 at 12:00
  • 2
    You need to add complete answers that fully address the question. Remember, more people will visit this page to get help on this question. Complete answers are a must. – Sotos Jul 10 '17 at 12:00
  • For people down voting, please check the title of the question and I addressed it. I'm not going to edit answer to include a column for quarter. – Kalees Waran Jul 10 '17 at 12:21
  • Ok. But at least don't question as to why people are downvoting an **incomplete** answer – Sotos Jul 10 '17 at 12:35
  • Enlighten me, where I questioned, "why people are down voting". Neither did I ask for up votes in first place. I just introduced OP to a nice function, `model.matrix`, which creates dummy variables for all situations, unlike the situational answers been provided here. What is bothering you @Sotos ?. You can go on with flagging this answer for all you wish. – Kalees Waran Jul 10 '17 at 12:41
  • 1
    First of all do NOT make this personal. It is not what it bothers me, or you or anybody else. It is how [WE should answer questions in SO](https://stackoverflow.com/help/how-to-answer). There are guidelines in order to make this site as valuable as possible. And at this point, the 4(!) downvotes are IMO due to the fact that you refuse to update your answer with the creation of `quarter` variable. Also, I don't need to flag your answer. – Sotos Jul 10 '17 at 13:05