Normalized histogram in MATLAB incorrect?

Question

I have the following set of data:

X=[4.692
   6.328
   4.677
   6.836
   5.032
   5.269
   5.732
   5.083
   4.772
   4.659
   4.564
   5.627
   4.959
   4.631
   6.407
   4.747
   4.920
   4.771
   5.308
   5.200
   5.242
   4.738
   4.758
   4.725
   4.808
   4.618
   4.638
   7.829
   7.702
   4.659]; % Sample set

I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:

Where the following bit of code is what draws the histogram:

[N,edges,bin] = histcounts(X,'BinMethod','auto');
bin_middles=mean([edges(1:end-1);edges(2:end)]);
f_X_sample=N/trapz(bin_middles,N);
bar(bin_middles,f_X_sample,1);;

Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!

Try increasing manually the number of bins, instead of using the `auto` flag — tashuhka, Oct 29 '15 at 11:43
Ok, will try this when I get to a computer. Is your suggestion not "artificial" a little bit, though? — space_voyager, Oct 29 '15 at 11:45

score 1 · Accepted Answer · edited May 23 '17 at 12:01

I would agree with @tashuhka's comment that you need to think about how you're binning your data.

Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.

There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.

Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.

Normalized histogram in MATLAB incorrect?

1 Answers1