0

I am collaborating on a project that requires me to use R of which I don't have any experience with to date. I am trying to apply auto arima to partitions/windows within my dataset and I haven't the slightest clue on how to even begin.3

Essentially, I want to train a separate model on each partner_id using the rows c_id = "none" and then forecast/predict values out to the max(date) for each partner_id. The number of months/rows for each partner vary in length. For this example data frame pasted below, partner_id = "1A9" has 12 months/rows with c_id = "none" vs partner_id = "1B9" has 13 months/row with c_id = "none". The number of months/rows extended out to the max(Date) within each partner_is varies as well. This is tricky as I assume I need to dynamically input how many months/rows to train on and how many months/rows to predict on for each partner_id.

I've included a sample dataset below.

x <- data.frame("c_id" = c("none","none","none","none","none",
"none","none","none","none","none","none","none","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101", "none","none","none","none","none","none","none","none","none","none","none","none","none","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111"), "partner_id" = c("1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9"), "rev_month" = as.Date(c("2016-01-01","2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01", "2016-09-01","2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01", "2017-01-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01", "2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01","2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01", "2020-01-01", "2020-02-01", "2020-03-01")), "rev" = c(101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 124.10, 125.35, 125.45), stingsAsFactors=FALSE)

My apologies for not having any code starter code yet as I am still trying to think about this conceptually while not having much experience with R at all. Ultimately, I'd like to add the column of predictions and confidence intervals back to my original dataframe. I'd be open to any R and/or Python Solutions.

Phil
  • 7,287
  • 3
  • 36
  • 66
bbal20
  • 113
  • 4
  • 11

1 Answers1

0

My answer is wrong on many levels from the programming point of view concerning R and time-series. The main aspects are (there are other issues but I understand that your concern is making it work asap):

  1. frist and foremost a loop should be avoided - BUT my guess is, that a vectorized solution would make it harder for you to understand

  2. using arima for a time-series that have not at least two complete cycles (years in this case) is not very promissing if your are looking to pick up seasonal patterns.

If your are genuinely interested in the topic of time-series predictions in R then read this book: https://otexts.com/fpp2/

A relevant side problem is your testing data: both series for partner have a repeated date on the first and second position that does not fly with time-series prediction of fixed periods/intervals - I just lagged the first to make things work. Therefore the new training data is this (we do not need the stringsAsFactores=FALSE):

 x <- data.frame(c_id = c("none","none","none","none","none","none","none","none","none","none","none","none","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-100","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101","c-101", "none","none","none","none","none","none","none","none","none","none","none","none","none","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-110","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111","c-111"), "partner_id" = c("1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1A9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9","1B9"),
                rev_month = as.Date(c("2015-12-01","2016-01-01","2016-02-01","2016-03-01","2016-04-01","2016-05-01","2016-06-01","2016-07-01","2016-08-01", "2016-09-01","2016-10-01","2016-11-01","2016-12-01","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01","2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01", "2016-12-31","2017-01-01","2017-02-01","2017-03-01","2017-04-01","2017-05-01","2017-06-01","2017-07-01","2017-08-01", "2017-09-01","2017-10-01","2017-11-01","2017-12-01","2018-01-01","2018-02-01","2018-03-01","2018-04-01","2018-05-01","2018-06-01","2018-07-01","2018-08-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01","2019-02-01","2019-03-01","2019-04-01","2019-05-01","2019-06-01","2019-07-01","2019-08-01","2019-09-01","2019-10-01","2019-11-01","2019-12-01", "2020-01-01", "2020-02-01", "2020-03-01")),
                rev = c(101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 101.25, 102.25, 103.50, 103.75, 104.15, 104.25, 104.3, 105.00, 105.20, 105.60, 106.00, 106.10, 106.50, 101.50, 100.30, 107.50, 108.30, 108.45, 109.10, 110.10, 112.15, 112.45, 114.65, 115.00, 116.00, 116.50, 117.25, 117.85, 119.25, 119.95, 120.20, 121.50, 122.30, 122.40, 123.25, 123.75, 124.00, 124.10, 125.35, 125.45))

Now we set up a data.frame to store the predictions - though this is not correct in theory ("never grow a vector") and there are better solutions BUT it would make it more complicated and not help the understanding of the implementation:

# empty data.frame to fill in predictions
predictions_df <- data.frame(c_id=character(),
                             partner=character(),
                             rev_month = character(),
                             rev=double())

Now we build a vector of unique partners to loop over:

# unique partners
partners <- unique(x$partner_id)

Lets call the libraries we need for this exercise:

library(xts)
library(dplyr)
library(forecast)

The main part is the loop itself:

# loop to build predictions and store them
for (i in 1:length(partners)){

  partner <- partners[i] # get specific partner
  x1 <- x[x$partner_id == partner, ] # get data for specific partner
  x1_t <- x1[x1$c_id == "none", c(3,4)] # training data
  x1_f <- x1[x1$c_id != "none", c(3,4)] # forecast data
  c_id <- x1[x1$c_id != "none", 1] # complementary data

  # convert training data to time-series object
  x1_t_ts <- xts(x1_t[,-1], order.by=as.Date(x1_t[,1], "%Y/%m/%d"))
  # run auto arima on the time series
  tm <- forecast::auto.arima(x1_t_ts)
  # forecast the number of future steps (rows for to predict data)
  fc <- forecast::forecast(tm, nrow(x1_f))

  predictions_df <- rbind(predictions_df, data.frame(c_id, partner, rev_month = as.character(x1_f$rev_month), rev = as.double(fc$mean)))

}

finally let us have a look at the results:

predictions_df

    c_id partner  rev_month      rev
1  c-100     1A9 2016-12-01 106.5409
2  c-100     1A9 2017-01-01 106.9818
3  c-100     1A9 2017-02-01 107.4227
4  c-100     1A9 2017-03-01 107.8636
5  c-100     1A9 2017-04-01 108.3045
6  c-100     1A9 2017-05-01 108.7455
7  c-100     1A9 2017-06-01 109.1864
8  c-100     1A9 2017-07-01 109.6273
9  c-100     1A9 2017-08-01 110.0682
10 c-100     1A9 2017-09-01 110.5091
11 c-100     1A9 2017-10-01 110.9500
12 c-100     1A9 2017-11-01 111.3909
13 c-101     1A9 2017-12-01 111.8318
14 c-101     1A9 2018-01-01 112.2727
15 c-101     1A9 2018-02-01 112.7136
16 c-101     1A9 2018-03-01 113.1545
17 c-101     1A9 2018-04-01 113.5955
18 c-101     1A9 2018-05-01 114.0364
19 c-101     1A9 2018-06-01 114.4773
20 c-101     1A9 2018-07-01 114.9182
21 c-101     1A9 2018-08-01 115.3591
22 c-101     1A9 2018-09-01 115.8000
23 c-101     1A9 2018-10-01 116.2409
24 c-101     1A9 2018-11-01 116.6818
25 c-101     1A9 2018-12-01 117.1227
26 c-110     1B9 2018-01-01 106.9375
27 c-110     1B9 2018-02-01 107.3750
28 c-110     1B9 2018-03-01 107.8125
29 c-110     1B9 2018-04-01 108.2500
30 c-110     1B9 2018-05-01 108.6875
31 c-110     1B9 2018-06-01 109.1250
32 c-110     1B9 2018-07-01 109.5625
33 c-110     1B9 2018-08-01 110.0000
34 c-110     1B9 2018-09-01 110.4375
35 c-110     1B9 2018-10-01 110.8750
36 c-110     1B9 2018-11-01 111.3125
37 c-110     1B9 2018-12-01 111.7500
38 c-111     1B9 2019-01-01 112.1875
39 c-111     1B9 2019-02-01 112.6250
40 c-111     1B9 2019-03-01 113.0625
41 c-111     1B9 2019-04-01 113.5000
42 c-111     1B9 2019-05-01 113.9375
43 c-111     1B9 2019-06-01 114.3750
44 c-111     1B9 2019-07-01 114.8125
45 c-111     1B9 2019-08-01 115.2500
46 c-111     1B9 2019-09-01 115.6875
47 c-111     1B9 2019-10-01 116.1250
48 c-111     1B9 2019-11-01 116.5625
49 c-111     1B9 2019-12-01 117.0000
50 c-111     1B9 2020-01-01 117.4375
51 c-111     1B9 2020-02-01 117.8750
52 c-111     1B9 2020-03-01 118.3125

If you like to get the confidence intervals, etc. please deconstruct the loop (run just the inner part with "i <- 1") and unterstand what is going on and what the returning values are. Then it should be no issue to use the shemata I have supplied to get what you need.

DPH
  • 4,244
  • 1
  • 8
  • 18
  • @DPM. This is very helpful. Thank you. For some reason I am having an issue with the – bbal20 Nov 17 '20 at 17:18
  • @bbal20 what is the issue (guess you comment got cut somehow) – DPH Nov 17 '20 at 17:22
  • @DPM. This is very helpful. Thank you. For some reason I am having an issue with the 'partners <- unique(x$partner_id) and x1 <- x[x$partner_id == partner, ] code snippets. I completely follow what it should be doing but for some reason it isn't returning all of the rows from my dataframe. It is only returning a subset and I get the error 'longer object length is not a multiple of shorter object length'. – bbal20 Nov 17 '20 at 17:25
  • @DPM x1 <- df2[df2$parent_id %in% partners, ] worked instead of x1 <- df2[df2$parent_id == partners, ]. I have no idea why – bbal20 Nov 17 '20 at 17:45
  • @DPM I am getting the error 'Error in as.Date.default(x, ...) : do not know how to convert 'x' to class “Date”' when running x1_t_ts <- xts(x1_t[,-1], order.by=as.Date(x1_t[,1], "%Y/%m/%d")) snippet. I'll see if I can't figure it out. Is it anything obvious to you that sticks out? – bbal20 Nov 17 '20 at 17:54
  • @bbal20 frist about the partners issue: you must use "partner" not "partners" as the frist one is a atomic vector (one item vector) and the second is a vector of all unique partners. Therefore the %in% works but the == does not – DPH Nov 17 '20 at 17:56
  • @bbal20 about the second issue: I do not know what format your actual date column is? maybe the sequence is not year with four digits (aka %Y) month with one ore two digits (aka %m) and day with one or two digits (aka %d) - meaning you would have to adapt that if the original data has a different sequence. Also the separator maybe not "/" so you would have to change that also. Here are some different ways to convert data.frames into timeseries objects: https://stackoverflow.com/questions/29046311/how-to-convert-dataframe-into-time-series – DPH Nov 17 '20 at 18:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/224689/discussion-between-bbal20-and-dph). – bbal20 Nov 17 '20 at 18:13