9

I have a sample data and I want to get the best fit distribution. I have got couple of links which suggest that I can import the distributions from scipy.stats, but then I am not aware of the type of data before hand. I want something similar to allfitdist() in MATLAB which tries to fit data to around 20 distributions and returns the best fit.

Link for allfitdist(): http://www.mathworks.in/matlabcentral/fileexchange/34943-fit-all-valid-parametric-probability-distributions-to-data

Any help is highly appreciable. Thanks.

tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
mvsrs
  • 117
  • 1
  • 1
  • 7
  • can you show what your data looks like, and what you tried to fit one distribution to the data? Just to know how far you are at implementing it, and where it fails. – usethedeathstar Feb 07 '14 at 09:24
  • The sample data is given by the user and it will not look same in all cases. I will upload the histogram image of sample data. I tried fitting the data to normal distribution and plot the curve to see whether it follows the trend of sample data but I was not successful as I dint get the curve in the plot. The part of the code I used is shown below. The other main doubt is even after plotting the normal distribution curve how would I know that it is the best fit? code used: plt.plot(da, stats.norm.pdf(da, *stats.norm.fit(datas1, scale=02, loc=0))) plt.hist(datas1,1000,color='b',ec='b',fc='b') – mvsrs Feb 07 '14 at 17:55

1 Answers1

18

You can just create a list of all available distributions in scipy. An example with two distributions and random data:

import numpy as np
import scipy.stats as st


data = np.random.random(10000)
distributions = [st.laplace, st.norm]
mles = []

for distribution in distributions:
    pars = distribution.fit(data)
    mle = distribution.nnlf(pars, data)
    mles.append(mle)

results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
Martin
  • 1,040
  • 9
  • 7
  • Thanks martin for your help. I am new to python and scipy, so bear with my ignorance. I ran this piece of code by changing the random data=np.random.normal(loc=0.0, scale=1.0, size=500) but, I got this error -> importerror: No module named stats. I have installed scipy. Do I have to do anything else to make this code work? – mvsrs Feb 11 '14 at 15:03
  • scipy.stats exists in scipy since version 0.7 (2009). Could you check which version of scipy you have and if you import scipy.stats? Try: `import scipy print scipy.__version__` – Martin Feb 11 '14 at 15:47
  • I re-installed scipy and it worked. The version I installed is 0.11. Thanks! – mvsrs Feb 11 '14 at 17:10
  • Should my data range between 0 & 1 for computing the best fit? – mvsrs Mar 25 '14 at 12:53
  • If it is a probability distribution you are fitting then the answer is no (which it is in the example). If it is a cumulative distribution then it better be in the interval <0, 1>. – Martin Mar 25 '14 at 13:00
  • `distribution.nnlf` can give somewhat inaccurate result. For instance, sampling `X_samples = stats.chi2(df=5).rvs(500)` and fitting `X_samples` to different distributions including chi2 results in chi2 being the best fit once in 5-7 code runs. – Julia Jun 26 '17 at 13:21