0

I have the following data:

y = np.array([8.8,7.2,5.8,4.7,3.8,3.1,2.6,2.2,2.0,1.7,1.8,1.8,1.9,1.7,1.4,1.2,1.7,1.2,1.5])   
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])  

I wish to fit a distribution to this data.

I've tried using scipy and fitter, but the distributions were of poor fit.

I got results akin to this example.

  1. Why do the distributions in said example seem to be scaled below the true data?
  2. Using my data, how do I fit a reasonable distribution? Any worked examples would be greatly appreciated.
ljmc
  • 4,830
  • 2
  • 7
  • 26
  • How did you get this data? What is it? It is not random samples from a univariate distribution, because you have `x` and `y`. (Or are the `x` values just sample numbers?) It is not samples from a PDF, because the integral over the whole x interval would be much larger than 1. It is not samples from a CDF (or even from a scaled CDF), because the y values decrease. – Warren Weckesser Sep 14 '22 at 20:55
  • Good points. (1) How did I get this data: It was gathered via a biological experiment. – Sam Huguet Sep 15 '22 at 12:49
  • (2) What is it? The x data represents a particular events, specifically, the number of 'breaks' at a certain site in the genome. For a given site where x = 6, that site would have 6 recorded 'breaks'. The y data represents the number of times we see each x value. For y=4 and x = 6, we'd see 4 different sites where the genome broke 6 times. – Sam Huguet Sep 15 '22 at 12:50
  • (3) Is it univariate? It is univariate. Where y varies, x values are essentially sample numbers (if I understand correctly). If it helps, I wish to fit a distribution to the values of y e.g. exponential or f etc. – Sam Huguet Sep 15 '22 at 12:50
  • (4) Is it a CDF? It is not a CDF. – Sam Huguet Sep 15 '22 at 12:50
  • (5) Is it a PDF? Ah, I see the issue. I want a PDF, however, your observation regarding the integral is sound - this is a unclear/wrong on my part. I will try to clarify my goals: – Sam Huguet Sep 15 '22 at 12:51
  • I have y data (essentially count data - they're not integers because I scaled them down from larger values) for the first 19 'x' values. In reality, the distribution actually ends at a much greater x value (x_max) - let's say x_max=100. I want to use the y data I have (for x values 1 to 19) to create a distribution which spans all the way to 100 (x_max). - Using this distribution, I could then pick a new x value (let's say at x = 80) and get the corresponding y probability. – Sam Huguet Sep 15 '22 at 12:51
  • Given that I don't have data to represent the entire PDF, is it possible to create one? The issue I can currently see is that I won't know how to scale my existing data, such that the final integral of the PDF = 1. – Sam Huguet Sep 15 '22 at 12:52
  • Thanks, that helps. If I understand correctly, `x` is inherently discrete. That is, it is nonsensical to have a value of `x` be, say, 23.45. Is that correct? – Warren Weckesser Sep 15 '22 at 16:37
  • You are correct – Sam Huguet Sep 16 '22 at 11:02

1 Answers1

0

I solved the problem via these steps:

(1) Warren's answer outlined that I couldn't fit a PDF - the 'area under the curve' was far greater than 1, and it should equal 1.

(2) Instead, I fit a curve to my data via the following code:

# Create a function which can create your line of best fit. In my case it's a 5PL equation. 
def func_5PL(x, d, a, c, b, g):
    return d + ((a-d)/((1+((x/c)**b))**g))

# Determine the coefficients for your equation.
popt_mock, _ = curve_fit(func_5PL, x, y)

# Plot the real data, along with the line of best fit. 
plt.plot(x, func_5PL(x, *popt_mock), label='line of best fit')
plt.scatter(x, y, label='real data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

my data, when a curve it fit to it

(4) When I had the curve, I just rescaled it such that it's integral was equal to 1 (for the range of x values that I was interested in). I treated this as my pdf.