Predicting the outcome based on first few results

Question

I need to solve a forecasting problem I have in my company. I am going to use the example of racing cars because it very closely matches the problem I am trying to solve:

Let's say I have data from hundreds of races and the distribution of time taken per car to complete the race is gamma distributed. I guess this is my prior. (The assumption is based on the same cars racing on the same track over and over).

In today's race all cars are required to use a slightly different fuel. Ie all vehicles will be impacted equally making them all slightly slower or faster but the outcome is still expected to be a gamma distribution (shifted to the left or right).

All cars start the race together. ie this is not an independent dataset!

As each car crosses the finish line, I want to be able to estimate the new gamma distribution parameters. Ie when the 5th and 6th and 7th car cross the line slightly faster than the 5th, 6th and 7th car in the previous races, we can predict that the race as a whole will be slightly faster.

What is the approach I should take?

We have a very similar real-world problem and I need to be able to forecast when 50%, 75% and 95% of the events will be completed. Some batches are slightly faster than others (we don't fully understand why) but the predicted outcome is fairly evident early in the batch. I need to write some code to automate the prediction.

Many thanks for your help!

This would greatly benefit from adding a simulated data set. Prose alone is far too ambiguous and inefficient at communicating technical problems. Also, without a language tag, it makes it very suspicious that this isn't merely a methodological question and should get pushed to CrossValidated. — merv, Jan 26 '22 at 22:53

xtratic · Answer 1 · 2022-01-25T01:40:09.520

I actually think this is just a math problem, not programming.. but problem solving is problem solving!

You just say it's a gamma-distributed function, but I assume we'll work with a probability density function where the horizontal axis is runtime, since that seems to make the most sense in this scenario.

I suspect, but have no proof and I'm not a math-guy, that if you don't know the actual gamma function then you wont be able to make a good prediction. The reason I think this is because even if you have many of the leftmost points, those points could still fit very many different gamma functions which could have very different reach on the runtime axis.

Therefore, I believe you must first predict the parameters of your gamma function based on your inputs.

Since you say this is a gamma function, use your historical data to calculate their probability density equations' shape and scale parameters. You may need to try with some different variation of the gamma functions to find which best fits your data.

With the inputs tied to those datasets, solve for the equations of how those inputs (number of race cars, fuel type, etc.) affect your probability density parameters (eg. shape and scale) for each dataset.

If your equations don't agree much: it may be that your chosen function doesn't fit well and you might need to pick a different probability density function to fit your data.

If those equations do seem to agree with each other: average the constants of those equations, then use that averaged equation with your new inputs to predict the parameters of the current function. With the parameters of your current function you should be able to predict your total runtime, especially as you see more datapoints.

Again.. not being a math guy, I'm not at all sure of this, but it seems reasonable from a problem solving point of view.

EDIT:

Ah, I think I misread and made some faulty assumptions. So you already have the gamma distributions for each individual car finish time? I was thinking of the distribution of all cars finish times together.

So based on the difference in expected time of each car that has crossed the finish line, you want to get an idea of how much faster the rest of the cars will be since they're all using this different "fuel".

The greatest difficulty lies in the fact that, even seeing the first few cars' new times, you still may not know the growth/decay rate at which this "fuel" affects your finish time. Or do you know?

Perhaps as more cars cross the finish line you can create a plot of the deltas from their expected finish times, then you could repeatedly calculate a best-fit curve for those points (maybe assuming quadratic is ok?), then use that curve function to adjust the other expected finish times of the remaining cars and determine a new longest expected finish time?

I don’t think that’s quite right. Using Bayes, we have a prior and we have the likelihood so we can calculate the posterior probability. However, because the inputs are not random nor independent I am really adjusting the posterior based on the first results. If I did this without some form of compensation then the posterior would always seem faster until all data is in. And yes, I could probably use some math but I figure there must be some genius out there who has designed the solution on pymc or similar — DrBorrow, Jan 24 '22 at 21:47
But thanks for the feedback! (PS: we need to program the solution because we have a software package which forecasts the next steps and I want to adjust the rate of delivery based on the earliest results. I don’t have an opportunity to manually input the new prediction. The current software uses the historical gamma as a fixed input) — DrBorrow, Jan 24 '22 at 21:50
@DrBorrow I've modified my answer to add some more thoughts. — xtratic, Jan 25 '22 at 01:38

Predicting the outcome based on first few results

1 Answers1