How to verify that model output & observed data distribution are similar?

Question

Looking for advice on how to determine wether my model output data distribution is similar (and if so, then how similar) to the observed datasets distribution.

Basically I have a GBM model with mean reversion that provides seemingly good results, when I compare its distribution to observed data. You can see their PDFs side-by-side in the attached picture.

PDF of observed and model data

Both datasets are huge (~6 million datapoint), and I start to suspect that this is part of the problem...

I am looking for a way to verify that the datasets distributions are similar. I tried the two-sample Kolmogorov-Smirnov test, two-sample t-test, but for some reason both of them rejected the Null hypothesis (always, even with different Alphas). In some threads I've read that these tests are unreliable, when applied to huge datasets, but there wasn't a consensus about this.

I am using Matlab currently, but I am open to others if necessary.

Any help would be appreciated! I primarily looking for a hypothesis test for verification, but if you have a different idea don't hold it back!

maybe try plotting the residuals (Residual = Observed value - Predicted value)... or for your case (Residual = Observed value - Simulated value) — a11, Mar 29 '19 at 01:59
With such large sample sizes, you will generally "always" reject the null. The large sample size ensures a way to find a discrepancy between the null and the alternative regardless. This is part of the reason software that does distribution fitting will only use first ~100,000 data points if given more. You said you want to see if "datasets are similar" but the statistical test you're wanting is testing to see if they are equal (to abuse the vocab a bit). — SecretAgentMan, Mar 29 '19 at 12:54
Perhaps [CrossValidated](https://stats.stackexchange.com/) will know of a statistical tests or methods appropriate to such a large sample size. My question is do you really need that for your application? Would a visual comparison of the estimated densities plus some side-by-side comparison of various metrics (first four moments, etc) suffice? — SecretAgentMan, Mar 29 '19 at 12:56

How to verify that model output & observed data distribution are similar?

0 Answers0