-3

I have a vector of dates like this:

ds <- lubridate::as_date(c("2015-11-23", "2015-11-24", "2015-11-25", 
     "2015-11-26", "2015-11-27", "2015-11-30", "2015-12-01", "2015-12-02",
                           "2015-12-03", "2015-12-04"))

This vector contains date in increasing order but in between some days are missing. In this example, Nov 28 and Nov 29 are missing, for example.

I now want to turn these dates into dummies.

One dummies should just be the month, the other dummy should indicate the position within each month. In the above example, the first observed value in Nov 2015 is Nov 23, 2015.

In this case the result would be:

df <- data.frame(November = c(1, 1, 1, 1, 1, 1, 0 ,0 ,0 ,0),
                 December = c(0, 0, 0, 0, 0, 0, 1 ,1 ,1 ,1),
                 d1 = c(1, 0,0,0,0,0,1,0,0,0),
                 d2 = c(0, 1,0,0,0,0,0,1,0,0),
                 d3 = c(0, 0,1,0,0,0,0,0,1,0),
                 d4 = c(0, 0,0,1,0,0,0,0,0,1),
                 d5 = c(0, 0,0,0,1,0,0,0,0,0),
                 d6 = c(0, 0,0,0,0,1,0,0,0,0)) 

> df
   November December d1 d2 d3 d4 d5 d6
1         1        0  1  0  0  0  0  0
2         1        0  0  1  0  0  0  0
3         1        0  0  0  1  0  0  0
4         1        0  0  0  0  1  0  0
5         1        0  0  0  0  0  1  0
6         1        0  0  0  0  0  0  1
7         0        1  1  0  0  0  0  0
8         0        1  0  1  0  0  0  0
9         0        1  0  0  1  0  0  0
10        0        1  0  0  0  1  0  0

where the d1 mean first observed date in this specific month. Please note that it should generalize to many years.

What I tried is this:

nov <- ds[months(ds) == 'November']

d1 <- ifelse(ds %in% nov & ds == dplyr::first(nov), 1, 0 )
spore234
  • 3,550
  • 6
  • 50
  • 76
  • what have you tried so far? – mondano Jul 17 '18 at 07:24
  • @mondano I added my tries but they weren't successful and do not generalize – spore234 Jul 17 '18 at 07:35
  • You probably want to take a look at `model.matrix`; but the question I have is *why* would you want to do that? – Maurits Evers Jul 17 '18 at 07:36
  • That is a weird structure. What are you trying to achieve? – Sotos Jul 17 '18 at 07:37
  • @MauritsEvers I do not want a dummy for every single date, just dummies for the way I described. – spore234 Jul 17 '18 at 07:37
  • @spore234 Yes, but *why* would you want to encode days as dummies in the way you describe. This doesn't make a lot of sense to me. Especially not in the context of a (potential?) statistical model where you'd commonly encode categorical variables via dummy variables; which in this case would make no sense. Can you elaborate on what you're trying to do with these dummy variables? – Maurits Evers Jul 17 '18 at 07:52
  • @MauritsEvers it's relevant in a business application where sales do not happen every day but the number of days where they happen are predictive. – spore234 Jul 17 '18 at 07:54
  • @spore234 I would understand if you encode days as the *day of the week*; in fact ARIMA models allow you to characterise such effects. But in your example you're ranking days; for example, `d1` of one month could correspond to the 1st of a month, and the 23rd of another month. So in such a case, the dummy encoding makes statistically no sense. If you expect some periodic effect, ARIMA models would be the way forward IMO. – Maurits Evers Jul 17 '18 at 08:12

1 Answers1

1

If I understand correctly the OP wants to create dummy variables for every month and for the events in order of appearance.

This can be solved using the dcast() and rowid() functions from the data.table package:

ds <- lubridate::as_date(c("2015-11-23", "2015-11-24", "2015-11-25", 
                           "2015-11-26", "2015-11-27", "2015-11-30", "2015-12-01", "2015-12-02",
                           "2015-12-03", "2015-12-04"))

library(data.table)
tmp <- data.table(ds)[, month := format(ds, "%Y-%m")]
dcast(tmp, ds ~ month, length, value.var = "ds")[
  dcast(tmp, ds ~ sprintf("d%02i", rowid(month)), length, value.var = "ds"),
  on = "ds"][, -"ds"]
    2015-11 2015-12 d01 d02 d03 d04 d05 d06
 1:       1       0   1   0   0   0   0   0
 2:       1       0   0   1   0   0   0   0
 3:       1       0   0   0   1   0   0   0
 4:       1       0   0   0   0   1   0   0
 5:       1       0   0   0   0   0   1   0
 6:       1       0   0   0   0   0   0   1
 7:       0       1   1   0   0   0   0   0
 8:       0       1   0   1   0   0   0   0
 9:       0       1   0   0   1   0   0   0
10:       0       1   0   0   0   1   0   0

Explanation

The date vector is turned into a data.table object where a column is added which represents the year and month in an unambiguous format (ISO 8601).

Then, dcast() is called twice: (1) to create the dummy variables for each month, (2) to create the dummy variables for the events. rowid(month) counts the events in order of appearance within each month. sprintf() is used to format the column headers with a leading 0 in case there are more than 9 events per month.

Each of the calls to dcast() create a part of the final solution. Both parts are combined by joining on the dates. Finally, ds is removed.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134