-1

I have the following set of data:

X=[4.692
   6.328
   4.677
   6.836
   5.032
   5.269
   5.732
   5.083
   4.772
   4.659
   4.564
   5.627
   4.959
   4.631
   6.407
   4.747
   4.920
   4.771
   5.308
   5.200
   5.242
   4.738
   4.758
   4.725
   4.808
   4.618
   4.638
   7.829
   7.702
   4.659]; % Sample set

I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:

enter image description here

Where the following bit of code is what draws the histogram:

[N,edges,bin] = histcounts(X,'BinMethod','auto');
bin_middles=mean([edges(1:end-1);edges(2:end)]);
f_X_sample=N/trapz(bin_middles,N);
bar(bin_middles,f_X_sample,1);;

Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!

space_voyager
  • 1,984
  • 3
  • 20
  • 31

1 Answers1

1

I would agree with @tashuhka's comment that you need to think about how you're binning your data.

Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.

There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.

Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.

Community
  • 1
  • 1
craigim
  • 3,884
  • 1
  • 22
  • 42