1

Using ggplot2's stat_ecdf( ) function, I have made a Cumulative Density Function plot. I am required to shade area under the CDF curve between two x-axis values and convert it to plotly output. Using the IRIS dataset, I have replicated the scenario with the following code:

library(ggplot2)

iris <- datasets::iris

iris <- iris[order(iris$Sepal.Length),]

(plot_1 <- ggplot(iris, aes(Sepal.Length)) + 
    stat_ecdf() +
    scale_x_reverse())

plot_1_plotly <- ggplotly(plot_1)
plot_1_plotly

(plot_2 <- ggplot(iris, aes(Sepal.Length)) + 
    stat_ecdf(aes(ymin = 0, ymax = ..y..), geom = "ribbon", alpha = 0.2, 
    fill = "blue") +
    stat_ecdf(geom="step") +
    scale_x_reverse())

plot_2_ggplotly <- ggplotly(plot_2)
plot_2_ggplotly
  • plot_1 produces this output, which is a normal CDF curve (not shaded)
  • plot_1_plotly produces this output, which is the plotly version (not shaded)
  • plot_2 produces this output, which is my attempt at getting the area under the curve shaded (with help from the answer to this question)
  • plot_2_plotly produces this output, which is the plotly version of plot_2

Question 1: In plot_2 output, how do I restrict the shaded area between two x-axis values (say x = 6 and x = 7)?

Question 2: When I convert plot_2 to plotly output i.e plot_2_plotly, why does the shaded area get messed up as shown in the output? How to get back to original form?

Siddharth Gosalia
  • 301
  • 1
  • 4
  • 18

1 Answers1

1

I was running into a similar issue trying to shade a region of the CDF curve for an exponential survival function. Using geom_polygon I was able to find a solution for a line plot of the CDF.

# creating poisson distribution with mean of 15 and cumulative count/ proportion
cumulative_frequencies <- data.frame(person_id=1:100,  
                      num_active_days=rpois(10000, lambda=15)) %>% 
                      group_by(num_active_days) %>% summarise(num_people = n()) %>% 
                      arrange(num_active_days) %>% 
                      mutate(cum_frequency=cumsum(num_people),
                      rel_cumfreq = cum_frequency/sum(num_people))


# create cdf curve
p <- ggplot(cumulative_frequencies, aes(x=num_active_days, y=rel_cumfreq)) +  
   geom_line() + 
   xlab("Time") + 
   ylab("Cumulative Density")  + theme_classic()
p

enter image description here

Then shading in the desired area under the curve using geom_polygon:

# minimum value of x for the area under the curve shading
x_start <- 15
x_end   <- 20

#Subset the data and add the coordinates to make it shade to y = 0
shade <- rbind(c(x_start,0), subset(cumulative_frequencies, num_active_days >= 
x_start & num_active_days <= x_end), c(x_end, 0))

# add shading to cdf curve
p + geom_polygon(data = shade, aes(num_active_days, rel_cumfreq))

enter image description here

daszlosek
  • 1,366
  • 10
  • 19