4

My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2

ds tickets_sold y event_id
3/12/19 90  90  1
3/13/19 40  130 1
3/14/19 13  143 1
3/15/19 8   151 1
3/16/19 13  164 1
3/17/19 14  178 1
3/20/19 10  188 1
3/20/19 15  203 1
3/20/19 13  216 1
3/21/19 6   222 1
3/22/19 11  233 1
3/23/19 12  245 1
3/12/19 30  30  2
3/13/19 23  53  2
3/14/19 43  96  2
3/15/19 24  120 2
3/16/19 3   123 2
3/17/19 5   128 2
3/20/19 3   131 2
3/20/19 25  156 2
3/20/19 64  220 2
3/21/19 6   226 2
3/22/19 4   230 2
3/23/19 63  293 2

I want to predict sales for the next 10 days of that data:

ds  tickets_sold y event_id
3/24/19 20  20  3
3/25/19 30  50  3
3/26/19 20  70  3
3/27/19 12  82  3
3/28/19 12  94  3
3/29/19 12  106 3
3/30/19 12  118 3

So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?

# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()

# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500

# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)

# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

fig1 = m.plot(forecast)
Joey Coder
  • 3,199
  • 8
  • 28
  • 60

1 Answers1

1

Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.

events = pd.DataFrame({
  'holiday': 'events',
  'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
  'lower_window': 0,
  'upper_window': 1,
})

m = Prophet(growth='linear', holidays=events)
m.fit(df)

Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.

tvgriek
  • 1,215
  • 9
  • 20
  • Hi tvgriek, an interesting idea with the holiday. I will try that. The overlapping part will be even "worse" as I have dozens of events that happen at the same time. The data I added here was just example data. I think in general it is good to have 'more' data as these events are from the same "tour" and therefore highly related to the predicted event sales. I think I should still include these? To the last part #cumsum. You mean I shouldn't build the sum of each day but rather give each sale individually to Prophet? – Joey Coder May 09 '19 at 15:39
  • 1
    Yes you can try to chain them. Week of the day might be of importance so try to keep that. Also, you can use the holidays to model other factors like 'marketing campaign went live on location X'. In the end it is about evaluating your model. You can use prophets build in ```cross_validation``` function to backtest your model and evaluate the decisions you make. You can then also try the difference between cumsum and daily sales. There is no golden bullet. Just try different things. – tvgriek May 09 '19 at 17:07
  • I have 10 different live events - e.g. a concert. They have different max capacities, therefore different marketing budgets. However, they are similar in spikes on ticket releases and spikes 2-3 weeks before the event etc. Looking into the idea you had I wonder if that's a valid approach. The 10 different events all happened in the last two years. Sales and therefore `y` are overlapping. My approach would be now to keep the month and day but give each event a different year. Event 1 (Sweden): 2018, Event 2 (Poland): 2017 etc. Does it make sense, given the fact that the cap per event differs? – Joey Coder Jun 24 '19 at 07:12
  • I was running short on characters but I wanted to add that I tested different models over the past few weeks and so far Prophet still gives me the best results. Now I trying to optimize it based on your suggestions. – Joey Coder Jun 24 '19 at 07:14
  • 1
    To deal with the different capacities you can scale each event with for example a min max scaler before you chain them. I suspect sales are highly correlated with marketing effort and location though. You might want to add these as extra regressors. Binning maybe by budget and continent / part of continent – tvgriek Jun 24 '19 at 07:20
  • Min mix scaler is a very good idea. I will try to implement that. – Joey Coder Jun 24 '19 at 07:26