0

I want to fit some data to a Pareto distribution using the scipy.stats library. I am not sure if the issue might be numerical, so just to be safe; I have values measured for the dependent variable (let's call them 'pushes') for the independent variable ('minutes') starting at a few thousand minutes and every ten minutes thereafter (with the exception of a few points that were removed during data cleaning).

e.g.

2780.0 362.0

2800.0 376.0

2810.0 393.0 ...

The best info I can find says something like

from scipy.stats import pareto
result = pareto.fit(data)

and I have no idea how this data is to be formatted in this case. I've tried the following but all result in errors.

result = pareto.fit(zip(minutes, pushes))
result = pareto.fit(pushes)

The error is usually

Warning: invalid value encountered in double_scalars

would greatly appreciate some guidance, thank you.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • The `pareto.fit()` method obtains an estimate of the parameters of the Pareto distribution that maximise the posterior probability of observing some given set of samples. It therefore wants only a *single* input array, consisting of the samples to fit to (the other kwargs control the fitting process, e.g. by specifying initial values for the distribution parameters). From your question it seems as though you actually want to fit some relationship `f(minutes, pushes)`, which is not what `pareto.fit()` does. Could you clarify what you're trying to do here? – ali_m Apr 11 '15 at 19:57
  • @ali_m yes i am trying to fit some relationship `f(minutes, pushes)`. i believe that it is basically a regression problem, but just from observing the plot of the data, it looks like a power law and very similar to the pareto distribution. sorry if this doesn't make sense, i am quite new at this. – user3525685 Apr 11 '15 at 20:05
  • In which case you need to select some function that relates your dependent variable to your independent variable (i.e. `pushes = f(minutes)`), then find the function parameters that minimize the mean squared error between 'predicted' and 'actual' pushes for each given value of 'minutes'. You could use [`scipy.optimize.curve_fit`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) or [`scipy.optimize.minimize`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html). – ali_m Apr 11 '15 at 20:16

1 Answers1

0

As I mentioned in the comments above, pareto.fit() is not what you're looking for.

The .fit() methods of the continuous distributions in scipy.stats obtain an estimate of the parameters of the distribution that maximise the probability of observing some particular set of sample values. Therefore, pareto.fit() wants only a single array argument containing the samples you want to fit the distribution to. The other keyword arguments control various aspects of the fitting process, for example by specifying initial values for the distribution parameters.

What you're actually trying to do is to fit the relationship between some independent variable x and some dependent variable y, i.e.

y_fit = f(x, params)

What you need to do is:

  1. Choose some functional form for f. From your description, the plot of y vs x resembles the probability density function for a Pareto distribution, so perhaps either this or a decaying exponential might be appropriate.

  2. Find the set of params that minimize some measure of the difference between y and y_fit (usually the sum of squared differences). You could use scipy.optimize.curve_fit or scipy.optimize.minimize to do this.

ali_m
  • 71,714
  • 23
  • 223
  • 298