Shape estimation in truncated Pareto in R

Question

library(VGAM)
library(fitdistrplus)

fitdist(u_NI$k_u, 'truncpareto',
         start = list(lower=1,
                      upper=42016,
                      shape=1)) -> fit.k_u

length(u_NI$k_u) = 637594

I got this error:

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, ddistnam = ddistname, hessian = TRUE, method = meth,     lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Error in fitdist(u_NI$k_u, "truncpareto", start = list(lower = 1, upper = 42016,  : 
  the function mle failed to estimate the parameters, 
                with the error code 100
In addition: Warning messages:
1: In fitdist(u_NI$k_u, "truncpareto", start = list(lower = 1, upper = 42016,  :
  The dtruncpareto function should return a zero-length vector when input has length zero
2: In fitdist(u_NI$k_u, "truncpareto", start = list(lower = 1, upper = 42016,  :
  The ptruncpareto function should return a zero-length vector when input has length zero

Is the problem in the excessive size of the dataset or in the start parameters?

REPRODUCIBILE EXAMPLE:

library(VGAM)
library(fitdistrplus)

rtruncpareto(100,1,100,1.5) -> a
fitdist(a, "truncpareto",
        start = list(lower=1,
                     upper=100,
                     shape=1.5))

This is not going to work, and I don't understand why.

It seems it has a problem here:

argument 'lower' must be positive

Maybe following the [documentation](https://search.r-project.org/CRAN/refmans/VGAM/html/paretoff.html) and last example of package VGAM. — Rui Barradas, Feb 05 '22 at 17:09

jde · Answer 1 · 2022-02-05T18:58:52.023

I am not exactly sure, but this might be what your after.

library(fitdistrplus)
library(VGAM)
library(ggplot2)

u_NI <- data.frame("k_u" = rtruncpareto(10000,
                                        lower = 1,
                                        upper = 100,
                                        shape = 1.5))

fit <- vglm(k_u ~ 1,
            truncpareto(lower=1, upper=max(u_NI$k_u) + 1),
            data = u_NI,
            trace = TRUE)

x <- seq(1, max(u_NI$k_u), length.out = 10000)
y <- dtruncpareto(x, shape = fit@coefficients[["(Intercept)"]],
                  lower=fit@extra$lower,
                  upper=fit@extra$upper)
pareto <- cbind.data.frame(x, y)

ggplot()+
  geom_density(data = u_NI, aes(k_u))+
  geom_line(data = pareto, aes(x = x, y = y, color = "truncpareto"),linetype = 5, size = 1.3)+
  theme(legend.position = c(.95, .95),
        legend.justification = c(1, 1),
        legend.title = element_blank())

########################## Edit:

# fix lower and upper boundary, estimate may still fail, depending on     the start value of shape
fitdist(u_NI$k_u, 'truncpareto',
        method = "mle",
        start = list(shape=1),
        fix.arg=list(lower=1, upper=max(u_NI$k_u) + 0.01),
        control=list(trace=1, REPORT=1)) -> fit.k_u

# use different method to estimate parameter, mge seems to work
fitdist(u_NI$k_u, 'truncpareto',
        method = "mge",
        start = list(shape=1,
                     lower=1,
                     upper=max(u_NI$k_u) + 1),
        control=list(trace=1, REPORT=1)) -> fit.k_u

The errors that remain refer to this:

> pnorm(numeric(), 1,10)
numeric(0)
> ptruncpareto(numeric(), 1,10,5)
[1] NA

####### Edit2: I think i found the original error and its a bit confusing. However, there is also a lower parameter for the mle, that is separate from the lower parameter of the distribution. It should restrain parameter estimates to >= 0. Hence, this should work, however it takes a good while, even for just 10,000 values:

fitdist(u_NI$k_u, 'truncpareto',
        method = "mle",
        start = list(shape=1,
                     lower=1,
                     upper=max(u_NI$k_u) +1 ),
        control=list(trace=1, REPORT=1),
        lower = 0) -> fit.k_u

This is a visual representation of the fit, but I just want the shape estimate, given lower and upper bounds. In the fit object I can find values for the estimates of lower and upper bounds, and then a value called "coefficient". But it is reported as Intercept. Even in the plot, shape refers to is as the Intercept. It is also noteworthy different from the parameter... — GiulioGCantone, Feb 05 '22 at 17:22
When I apply your code to my data, this is the error: `Error in eval(slot(family, "initialize")) : the value of argument 'lower' is too high (requires '0 < lower < min(y)')` — GiulioGCantone, Feb 05 '22 at 17:28
1, just as in the toy data. Curiously, with `k_u+1` it converges to estimates; even if again, these are not unbiased. — GiulioGCantone, Feb 05 '22 at 18:12
I can only add that by other numerical means, I believe that the data in `k_u` are coherent with the hypothesis that the shape parameter for `k_u` is 1.7, given a upper bound = 42016. — GiulioGCantone, Feb 05 '22 at 18:22
from what I can tell, the error stems from incompatible parameter estimates of mle, e.g. negative lower boundaries or shape estimates. I have found two possibilities to circumvent them and edited my original answer. — jde, Feb 05 '22 at 18:35

Ben Bolker · Accepted Answer · 2022-02-06T00:03:21.960

2

Part of your problem is a more general issue, i.e. that for a truncated distribution the MLE of the boundary parameter(s) is in general equal to the observed min/max of the data set. So you should always be able to do at least as well by setting the values of the lower/upper bounds equal to the min/max (based on my experience trying this they have to be slightly below/above the observed bounds). (I also found I had to set lower = 0 to stop the algorithm from trying negative values of the shape parameter.)

library(VGAM); library(fitdistrplus)
set.seed(101)
rtruncpareto(100,1,100,1.5) -> a
eps <- 1e-8
fitdist(a, "truncpareto",
        start = list(shape=1.5),
        fix.arg = list(lower = min(a) - eps, upper = max(a) + eps),
        lower = 0)
Parameters:
      estimate Std. Error
shape 1.349885  0.1554436
Fixed parameters:
          value
lower  1.006844
upper 25.906577

An alternative to fitdist would be bbmle:

library(bbmle)
m1 <- mle2(a ~ dtruncpareto(lower = min(a) - eps,
                            upper = max(a) + eps,
                            shape = exp(lshape)),
           start = list(lshape = 0),
           data = data.frame(a),
         method = "BFGS")
exp(coef(m1))   
  lshape 
1.349884

bbmle is a little bit more flexible and allows you to fit the shape parameter on the log scale, which is generally more robust (and makes the standard deviation estimates more useful). Using method = "BFGS" here because the default Nelder-Mead algorithm works poorly for 1-dimensional optimization; could also use method = "Brent" (which would be more efficient and robust) but would then need to provide explicit lower and upper bounds for the lshape parameter.

edited Feb 06 '22 at 00:03

answered Feb 05 '22 at 20:39

Ben Bolker

211,554
25
370
453

When I follow your code with bbmle, it works. When I plug my data instead of toy data, this is the error: Error in dtruncpareto(x = c(2273L, 2148L, 1838L, 1702L, 1672L, 1559L, : argument 'shape' must be positive This happens also when I set start = list(shape = 1.5) – GiulioGCantone Feb 11 '22 at 17:49
What do you see if you add `trace=TRUE` to the `mle2()` call? – Ben Bolker Feb 11 '22 at 19:26
1.5 303055.6 1.501 303447.8 1.499 302664.4 Error in dtruncpareto(x = c(2273L, 2148L, 1838L, 1702L, 1672L, 1559L, : argument 'shape' must be positive It seems that it starts, than it fails. – GiulioGCantone Feb 11 '22 at 19:46
how big is your data set? Can you post it or post a link to it? The only thing I can think of is that `lshape` is going strongly negative so that `lshape` underflows to 0 (and that the tracing is only happening *after* the likelihood evaluation, so we're not seeing when it fails). What happens if you try `shape = 1e-8 + exp(lshape)` instead of `shape = exp(lshape)` ? – Ben Bolker Feb 11 '22 at 19:56
Right now I cannot share the database; I could extract just the vector later. When I add `1e-8` the algorithm converges into `exp(1.1) = 2,77`; when I apply `shape = lshape`, it converges into `2.77`. This result it's a bit off of my "prior", given results obtained with `mge`. – GiulioGCantone Feb 11 '22 at 20:16
Not sure what "when I apply `shape = lshape` means ... not surprising that two different objective functions (MLE vs MGE ['maximum goodness of fit', whatever that is precisely, I'm not sure]) don't give the same answer, not sure how to judge how different they "should" be, especially without seeing the data ... – Ben Bolker Feb 11 '22 at 20:26
However, I filtered `k_u > x`, with `x` being integers and I checked that the `alpha` is coherent with `mge` for all integers. The shift is only for `x = 0`, when `k_u` includes `k_u == 1`, which leads me to think that the vector is a truncated pareto with 1-inflation, and the 1-inflation is responsible for the mis-estimation. – GiulioGCantone Feb 11 '22 at 20:26

Shape estimation in truncated Pareto in R

2 Answers2