How can the Weibull PDF parameters be correctly determined from a series of measurements?

Question

Assuming I have a series of hourly measured values, such as the mean wind speed. A start and end date is used to limit the data in terms of time. From these data I can calculate the frequency of the values for individual categories. The first category includes all values between 0 and < 0.5 km/h. The second all values between 0.5 and < 1.5 km/h, the third all values between 1.5 and < 2.5 km/h and so on. Counting all values results in the following total distribution:

Category    Amount  Frequency (in %)
0-1 km/h    42      0.64
1-2 km/h    444     6.78
2-3 km/h    871     13.30
3-4 km/h    1130    17.25
4-5 km/h    1119    17.08
5-6 km/h    934     14.26
6-7 km/h    703     10.73
7-8 km/h    490     7.48
8-9 km/h    351     5.36
9-10 km/    219     3.34
10-11km/h   143     2.18
11-12 km/h  52      0.79
12-13 km/h  13      0.20
13-14 km/h  15      0.23
14-15 km/h  6       0.09
15-16 km/h  6       0.09
16-17 km/h  4       0.06
17-18 km/h  3       0.05
18-19 km/h  4       0.06
20-21 km/h  2       0.03

How can the Weibull scaling factor and the Weibull shape factor be determined from these values (e.g. with python, reliability (?)) ?

So far I have only passed all individual values from the measurement series to python reliability (Fit_Weibull_2P) and thus determined the two parameters. However, the determined parameters do not seem to be correct (the curve is drawn incorrectly later) or I do not pass the values correctly to Fit_Weibull_2P.

Does anyone have an idea where I have an error or how it can be solved differently? Maybe not with the individual values, but with the frequency?

Well, if all you have are the binned values, the right way to proceed is something called interval censoring plus weighted maximum likelihood. A close second is to just apply weighted maximum likelihood to the midpoints of the bins (i.e., ignore the width), and a third approach is to approximate the second approach by inventing replicated data which represent the bin midpoints, replicated a number of times proportional to the bin frequency. E.g. 64 replicas for 0.5 km/h, 678 for 1.5, 1330 for 2.5, etc. Then apply the ordinary Weibull fitting to that. — Robert Dodier, Mar 06 '21 at 00:55
But first look to see if whatever library you're using already handles binned or censored data. — Robert Dodier, Mar 06 '21 at 00:55
Thanks, I have tested `scipy` (exponweib.fit) and `reliability` (Fit_Weibull_2P). Both with all sample data and I get on both functions values for shape and scale that seem to be underestimated (shape: 2.01, scale: 3.68). So i tried to find a solution for estimating the parameters through the bins of the histogram. Using `exponweib.fit_loc_scale(data, 1, 1)` on the binned values I get other results: shape: 0.92, scale: 6.32. I would expect values around 1.98 for shape and 5.60 for scale as suggested by another web application that is my reference to test data. The results of R seem to fit. — SnoopyBrown, Mar 08 '21 at 08:45
Probably obvious, if you want to fit e.g. estimated power from a wind farm, do importance weighting: min integral( powercurve * (data - Weibull) ) may be quite different from min integral( data - Weibull ). — denis, Jul 23 '21 at 08:53

score 1 · Accepted Answer · answered Mar 08 '21 at 15:06

I don't know what your sample data is, but this gets pretty good approximation even using the binned data. Compare (1) without using floc=0 with (2) specifying floc=0 to force the left boundary to be at 0.

import numpy as np
from scipy.stats import weibull_min

x=np.concatenate((np.repeat(.25,42), np.repeat(1, 444), np.repeat(2, 871), np.repeat(3, 1130),
            np.repeat(4, 1119), np.repeat(5, 934), np.repeat(6, 703),
            np.repeat(7, 490), np.repeat(8, 351), np.repeat(9, 219),
            np.repeat(10, 143), np.repeat(11, 52), np.repeat(12, 13),
            np.repeat(13, 15), np.repeat(14, 6), np.repeat(15, 6),
            np.repeat(16, 4), np.repeat(17, 3), np.repeat(18, 4), [20,20]))

print(weibull_min.fit(x)) #1
(1.8742154858771933, 0.13126151114447493, 4.99670007482597)

print(weibull_min.fit(x, floc=0)) #2
(1.9446899445880135, 0, 5.155845183708194)

Okay, this looks quite good! Why did you use `.25` at first? As the center between the min and max values of the bin? — SnoopyBrown, Mar 09 '21 at 12:51
Yes, I just took the average of the left and right sides of the bins. Ideally `weibull_min.fit` takes your actual data points, but since the bins are small I thought why not. — Vons, Mar 09 '21 at 17:14

score 0 · Answer 2 · answered Mar 06 '21 at 07:13

This may or may not help you, but here is how you could do it in R.

text="
Category    Amount  'Frequency (in %)'
'0-1 km/h'    42      0.64
'1-2 km/h'    444     6.78
'2-3 km/h'    871     13.30
'3-4 km/h'    1130    17.25
'4-5 km/h'    1119    17.08
'5-6 km/h'    934     14.26
'6-7 km/h'    703     10.73
'7-8 km/h'    490     7.48
'8-9 km/h'    351     5.36
'9-10 km/h'    219     3.34
'10-11km/h'   143     2.18
'11-12 km/h'  52      0.79
'12-13 km/h'  13      0.20
'13-14 km/h'  15      0.23
'14-15 km/h'  6       0.09
'15-16 km/h'  6       0.09
'16-17 km/h'  4       0.06
'17-18 km/h'  3       0.05
'18-19 km/h'  4       0.06
'20-21 km/h'  2       0.03
"
df=read.table(text=text, header=TRUE)
left=c(0)
right=c(.5)
for (i in 2:20) {
  left[i]=i-2+.5
  right[i]=i-1+.5
}
df1=mutate(df, left=left, right=right)
library(tidyr)
df1=uncount(df1, Amount)
bins=select(df1, left, right)
fitdistcens(bins, "weibull")

Fitting of the distribution ' weibull ' on censored data by maximum likelihood 
Parameters:
      estimate
shape 1.953459
scale 5.152375

Thanks. The values seem to fit to the results I would expect realted to a reference application. But since I want to evaluate the data via a web interface using Java, I thought a small Python script would be quite suitable. — SnoopyBrown, Mar 08 '21 at 10:34
@SnoopyBrown I'm getting pretty good results using weibull_min from scipy — Vons, Mar 08 '21 at 15:07

Derryn Knife · Answer 3 · 2021-12-14T06:51:57.700

This is a case of interval censored data. That is, the data point is not exactly known, but is known have occurred in some window.

The python package surpyval, found here (I am it's author), is a good way to do this.

import surpyval as surv

# count vector
n = [42, 444, 871, 1130, 1119, 934, 703, 490, 351, 219, 143, 52, 13, 15, 6, 6, 4, 3, 4, 2]
# interval vector
x = [[l, u] for l, u in zip(range(0, 19), range(1, 20))] + [[20, 21]]

model = surv.Weibull.fit(x=x, n=n)
model

Parametric SurPyval Model
=========================
Distribution        : Weibull
Fitted by           : MLE
Parameters          :
     alpha: 5.726746093800134
      beta: 2.1824674168785507

It also appears that your data is actually right-truncated. That is, you have no observations above 21. This can also be added to the estimate.

model = surv.Weibull.fit(x=x, n=n, tr=21)
model

Parametric SurPyval Model
=========================
Distribution        : Weibull
Fitted by           : MLE
Parameters          :
     alpha: 5.726746697131137
      beta: 2.182465361355963

although this doesn't change the answer.

How can the Weibull PDF parameters be correctly determined from a series of measurements?

3 Answers3