1

I want to calculate the cumulative probabilities for my sample for each factor level and save them to a data frame. However, the calculated probabilites don't reach 1.0 and stop e.g. at 0.7 which cannot be true. Somehow it always reaches 1.0 only for one group.

Here is a reproducible example:

library(datasets)

ecdf_fun <- ecdf(iris$Sepal.Width)

dset <- iris %>% group_by(Species) %>%
  reframe(ecdval = ecdf_fun(Sepal.Width))

Which delivers:

    Species    ecdval
1   setosa     1.000000000
2   setosa     0.993333333
...
51  versicolor 0.833333333
52  versicolor 0.753333333
...
101 virginica  0.960000000
102 virginica  0.960000000

ADD-ON: Ideally, I would like to retrieve the cumulative probabilites in combination with their respective x values (Sepal.Width).

    Species    ecdval       Sepal.Width
1   setosa     1.000000000  0.6
2   setosa     0.993333333  ...
...
51  versicolor 0.833333333  1.8
52  versicolor 0.753333333  ...
...
101 virginica  0.960000000  2.5
102 virginica  0.960000000  ...
Joschi Nin
  • 37
  • 5
  • 2
    You are calculating the `ecdf` on the whole dataset but then grouping it to create `dset`, so it is not surprising that only one group contains the maximum value (i.e. the ecdf reaches 1.0) – Andrew Gustar May 05 '23 at 15:13

1 Answers1

1

As Andrew Gustar says the ecdf() needs to be grouped. Then use mutate to keep the original data along with the cdf?

dset <- iris %>% group_by(Species) %>%
  mutate(ecdval = ecdf(Sepal.Width)(Sepal.Width))

ggplot(dset, aes(Sepal.Width, ecdval, col=Species)) + geom_point() + geom_line()

enter image description here

George Savva
  • 4,152
  • 1
  • 7
  • 21
  • Great, this works! Thank you, this is exactly what I needed. Could you please tell me why there are two brackets, I am not familiar with this syntax: "ecdf(Sepal.Width)(Sepal.Width)" I couldn't find an explanation online and it should be useful for the future. – Joschi Nin May 08 '23 at 07:12
  • 1
    `ecdf(x)` returns a function (as in your code in the question) so `ecdf(x)(x)` returns the value of that function at each value of `x` – George Savva May 08 '23 at 14:46