I have two models, m1 and m2 (reference model), for forcasting electricity prices at 6 time points each day. I use MSE to measure model performance over a test interval (5 days) after forecast initialization, and I compare the models using a skill score (1-m1/m2) to see which is the best. I do this for a large number of test intervals.
The observed prices, the predicted forecasts and the squared errors all exhibit some correlation (not independent, not exchangeable).
If the skill is high (close to 1), it is easy to say that m1 is an improvement on m2, but when skill is close to 0, e.g. 0.05, how can i be sure that m1 is an improvement?
One option is to use a monte carlo permutation test with skill score as test statistic. But this requires that the data is exchangeable. What are the alternatives for correlated time series data? What if I sum up results for each day, and permute over these, is this a valid approach if the sums are not correlated?