0

It might be a simple question but I have been trying to fix this issue for a few days so any help is really appreciated. If there is something wrong about my question, please leave a comment so I know what I am doing wrong.

I have the following code:

library(ggplot2)
library(rlang)
library(MASS)

normalize_btw_zero_and_one <- function(x){(x-min(x)+.00000001)/(max(x)-min(x)+.00000002)}

get_histogram <- function(data_set, column_name, bin_width, attribute_name) {
  colname     <- as_label(enquo(column_name))
  gamma_par   <- MASS :: fitdistr(data_set[[colname]], "gamma")
  weibull_par <- MASS :: fitdistr(data_set[[colname]], "weibull")
  beta_par    <- MASS :: fitdistr(normalize_btw_zero_and_one(data_set[[colname]]), dbeta, start = list(shape1 = 1, shape2 = 1))
  ggplot(data_set, aes(x= {{ column_name }})) +
    geom_histogram(aes(y= after_stat(density)), binwidth = bin_width, fill= "lightblue", colour="black") +
    xlab(paste0(attribute_name)) +
    stat_function(fun = dnorm , args= list(mean= mean(data_set[[colname]]), sd= sd(data_set[[colname]])), 
                  mapping = aes(colour = "Normal"))+
    stat_function(fun = dlnorm, args = list(meanlog= mean(log(data_set[[colname]])), sdlog= sd(log(data_set[[colname]]))),
                  mapping = aes(colour = "LogNormal")) + 
    stat_function(fun = dgamma, args= list(shape= gamma_par$estimate[[1]]  , rate=gamma_par$estimate[[2]] ),
                  mapping = aes(colour= "Gamma"))+
    stat_function(fun = dweibull, args= list(shape= weibull_par$estimate[[1]]  , scale=weibull_par$estimate[[2]] ),
                  mapping = aes(colour= "Weibull"))+
    stat_function(fun = dexp, mapping = aes(colour = "Exponential")) +
    stat_function(fun = dbeta, args = list(shape1= beta_par$estimate[[1]], shape2= beta_par$estimate[[2]] ),
                  mapping = aes(color = "Beta")) +
    scale_colour_manual("Distribution", values = c("red", "blue", "lightgreen","pink", "purple" , "yellow"))
}

set.seed(30333)
test_dt <- rnorm(10000, 30, 1)
df <- data.frame(test_dt)
rm(test_dt)

get_histogram(df, test_dt, .3, "Test ")

Issue: I want to create a histogram and overlay multiple density functions on it. Everything works fine except for the Beta distribution. The issue lies in the fact that I scaled the data to (0,1) for the calculation of the density function, but the histogram still displays the original data.

Question 1: How can I fix the overlay of the Beta distribution function?

Question 2: If I want to have frequencies on the y-axis instead of densities, what steps should I take?

Joe the Second
  • 339
  • 1
  • 11

1 Answers1

1

I'm not convinced about attempting to fit the beta distribution here, but if you want to try it, you will need to define a density function that normalizes the vector internally, gets the dbeta and scales it according to the range of the new support:

get_histogram <- function(data_set, column_name, bin_width, attribute_name) {
  
  colname <- as_label(enquo(column_name))
  x <- data_set[[colname]]
  shift <- min(x)
  scale <- diff(range(x))
  norm_it <- function(x) (x - shift + 1e-8) / (diff(range(x)) + 2e-8)
  dbeta_norm <- function(x, shape1, shape2) {
    dbeta(norm_it(x), shape1, shape2)/scale
  }
  
  gamma_par   <- fitdistr(x, "gamma")$estimate
  weibull_par <- fitdistr(x, "weibull")$estimate
  beta_par    <- fitdistr(norm_it(x), dbeta, 
                          start = list(shape1 = 1, shape2 = 1))$estimate
  
  ggplot(data_set, aes(x = {{ column_name }})) +
    geom_histogram(aes(y = after_stat(density)), binwidth = bin_width, 
                   fill = "lightblue", colour="black") +
    xlab(paste0(attribute_name)) +
    stat_function(fun = dnorm, args = list(mean = mean(x), sd = sd(x)), 
                  mapping = aes(colour = "Normal")) +
    stat_function(fun = dlnorm, 
                  args = list(meanlog = mean(log(x)), sdlog = sd(log(x))),
                  mapping = aes(colour = "LogNormal")) + 
    stat_function(fun = dgamma, 
                  args = list(shape = gamma_par[1], rate = gamma_par[2]),
                  mapping = aes(colour = "Gamma")) +
    stat_function(fun = dweibull, 
                  args = list(shape = weibull_par[1], scale = weibull_par[2]),
                  mapping = aes(colour= "Weibull"))+
    stat_function(fun = dexp, mapping = aes(colour = "Exponential")) +
    stat_function(fun = dbeta_norm, 
                  args = list(shape1 = beta_par[1], shape2 = beta_par[2] ),
                  mapping = aes(color = "Beta")) +
    scale_colour_manual("Distribution", 
                        values = c("red", "blue", "lightgreen",
                                   "pink", "purple" , "yellow"))
}

Resulting in:

get_histogram(df, test_dt, .3, "Test ")

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks for responding. I'm working with various datasets, some Beta, some not. Need to visually identify the best-fit distribution. Your code works on simulated data but produces puzzling graphs on my data. In some cases, the histogram is flat overlaying on y-axis, and the Beta line doesn't fit. I wonder what is the reason. so, why dbeta_norm() has a numerator of "scale"? – Joe the Second Aug 22 '23 at 16:18
  • I mean the denominator should be (scale + shift).? – Joe the Second Aug 22 '23 at 20:33