2

Non-Censored (Complete) Dataset

I am attempting to use the scipy.stats.weibull_min.fit() function to fit some life data. Example generated data is contained below within values.

values = np.array(
    [10197.8, 3349.0, 15318.6, 142.6, 20683.2, 
    6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)

I attempt to fit using the function:

fit = scipy.stats.weibull_min.fit(values, loc=0)

The result:

(1.3392877335100251, -277.75467055900197, 9443.6312323849124)

Which isn't far from the nominal beta and eta values of 1.4 and 10000.

Right-Censored Data

The weibull distribution is well known for its ability to deal with right-censored data. This makes it incredibly useful for reliability analysis. How do I deal with right-censored data within scipy.stats? That is, curve fit for data that has not experienced failures yet?

The input form might look like:

values = np.array(
    [10197.8, 3349.0, 15318.6, 142.6, np.inf, 
    6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)

or perhaps using np.nan or simply 0.

Both of the np solutions are throwing RunTimeWarnings and are definitely not coming close to the correct values. I using numeric values - such as 0 and -1 - removes the RunTimeWarning, but the returned parameters are obviously flawed.

Other Softwares

In some reliability or lifetime analysis softwares (minitab, lifelines), it is necessary to have two columns of data, one for the actual numbers and one to indicate if the item has failed or not yet. For instance:

values = np.array(
    [10197.8, 3349.0, 15318.6, 142.6, 0, 
    6976.5, 2590.7, 11351.7, 10177.0, 3738.4]
)

censored = np.array(
    [True, True, True, True, False,
    True, True, True, True, True]
)

I see no such paths within the documentation.

slightlynybbled
  • 2,408
  • 2
  • 20
  • 38
  • Somewhere on the way, `np.log` seems to be used and I guess that `inf`, `nan` and `0` cause issues there. Would it be possible for you to replace those entries somehow, `values[np.isinf(values)] = 10000.`, `values[np.isnan(values)] = 1.` and `values[np.isclose(values, 0.)] = 10 ** (-6)` or something like this? – Cleb Dec 14 '17 at 08:45
  • @Cleb it isn't an issue of numeric values. For instance `0` works just fine as the algorithm makes its way through, but `0` is simply doesn't produce the correct results. – slightlynybbled Dec 14 '17 at 11:51
  • OK, was just a wild guess, as I saw `RuntimeWarning: invalid value encountered in subtract return np.log(c) + sc.xlogy(c - 1, x) - pow(x, c)`; so I thought the `log(c)` part could be the issue. When I then ran your example and replaced `np.inf` by another higher value, it worked fine. But I know too little about this to be of help, I am afraid... – Cleb Dec 14 '17 at 11:58
  • As you mentioned "other softwares": Could you make this more specific and - if available - also provide code how you would call it in the other language? – Cleb Dec 14 '17 at 12:05
  • 1
    @Cleb Other softwares are things like [minitab](http://www.minitab.com) (I am not a user, but I have seen video tutorials). The [lifelines](https://github.com/CamDavidsonPilon/lifelines) project supports a similar interface. I have been trying different avenues for this type of analysis for a while now, so I also have a [github repository](https://github.com/slightlynybbled/weibull) that I'm playing with to get the functionality that I need. I may end up attempting to merge that with lifelines, but i want to get it more functional before that point. – slightlynybbled Dec 14 '17 at 12:17
  • FYI: You used `fit = scipy.stats.weibull_min.fit(values, loc=0)`, and if you notice, the returned value for `loc` is `-277.75...`. If you intended to fix the location at 0, you must use the argument `floc=0`. That is, use `fit = scipy.stats.weibull_min.fit(values, floc=0)`. – Warren Weckesser Dec 14 '17 at 20:09
  • "I see no such paths within the documentation.": I can't find anything like that either. Nor can I find a way of doing regression with Weibull errors in statsmodel. However, it's possible in R. See http://www.itl.nist.gov/div898/handbook/apr/section4/apr413.htm, in case you haven't encountered that. – Bill Bell Dec 14 '17 at 21:02
  • Kaplan-Meier is also 'natural' for right-censored data, and is available in lifelines which you know about. I'm curious about the deeper concerns that might be expressed by your question. – Bill Bell Dec 14 '17 at 22:05
  • @WarrenWeckesser the `floc=0` appears to be a good suggestion, but unfortunately not the answer :/ @BillBell Unfortunately, I"m stuck with Weibull. I'm in engineering and this is the most widely used distribution without having a great reason to switch - like demonstrating that the Weibull is not an adequate fit. Thank you! I am fortunately learning a lot from the process! – slightlynybbled Dec 15 '17 at 01:20
  • The suggestion to use `floc=0` was not meant to be an answer to your question. I just wanted to be sure that you were using the arguments to the `fit` method correctly in the non-censored case. – Warren Weckesser Dec 15 '17 at 01:56
  • Another FYI (not an answer): A sample that is right-censored would not be recorded as 0, nan or inf. If your data is, say, time until failure of a collection of machines under test, and the test is halted after 14400 minutes, then the sample times for those machines that had not failed by the end of the test would be recorded as 14400, and they would be flagged as being right-censored (meaning the failure time is greater than 14400 minutes). – Warren Weckesser Dec 19 '17 at 02:14
  • @slightlynybbled I'm the lifelines author. The `WeibullFitter.fit` accepts the durations and the event array: `WeibullFitter.fit(values, event_observed=censored)` – Cam.Davidson.Pilon Dec 21 '17 at 01:52
  • @Cam.Davidson.Pilon Yes, I like the lifelines package! After I understand more, I'm planning to take a closer look at the package. On first glance, I wasn't certain that the lifelines package implemented the standard 2-parameter weibull function that is required. I'm relatively new to reliability analysis and I'm just getting a handle on all of the vocabulary, but I will take another look soon. – slightlynybbled Dec 21 '17 at 15:40

1 Answers1

1

Old question but if anyone comes across this, there is a new survival analysis package for python, surpyval, that handles this, and other cases of censoring and truncation. For the example you provide above it would simply be:

import surpyval as surv
values = np.array([10197.8, 3349.0, 15318.6, 142.6, 6976.5, 2590.7, 11351.7, 10177.0, 3738.4])

# 0 = failed, 1 = right censored
censored = np.array([0, 0, 0, 0, 0, 1, 1, 1, 0])

model = surv.Weibull.fit(values, c=censored)
print(model.params)

(10584.005910580288, 1.038163987652635)

You might also be interested in the Weibull plot:

model.plot(plot_bounds=False)

Weibull plot

Full disclosure, I am the creator of surpyval