2

I have a large dataset with original and imputed values and the proportional differences between the two. The quantiles of the proportional differences are:

> quantile(p$prdif, probs=c(0, 0.1, 0.2, 0.3, .4, .5,0.6, 0.7, 0.8, 0.9, 1))
         0%         10%         20%         30%         40%         50%         60%         70%         80% 
-0.99269227 -0.43367924 -0.22983182 -0.07498240  0.06285345  0.20829226  0.39253900  0.65837197  1.18619469 
        90%        100% 
11.25010211         Inf

I plot the histogram of the proportional differences with the following command:

  # Calculate means
mu <- p %>% filter(orig != 0) %>%
  summarise(mu1 = mean(orig), mu2 = mean(imp), mu3 = mean(dif), mu4 = mean(prdif) )
ggplot(p %>% filter(orig != 0), aes(x= prdif) )+
  geom_histogram(aes(y=..density..), position="identity", alpha=0.4, fill = 'blue')+
  geom_density(alpha=0.6, size = 2)+
  geom_vline(data=mu, aes(xintercept=mu4, color= "red" ),
             linetype="dashed", size = 1.5)+ 
  labs(title="Differences between imputed and original values",x="Proportional Difference", y = "Density")

Which results in the following plot:

histogram without coord_cartesian clause

In order to focus in the area where the majority of the values are --i.e. -1, +2-- I use the coord_cartesian function in the following manner:

ggplot(p %>% filter(orig != 0), aes(x= prdif) )+
  geom_histogram(aes(y=..density..), position="identity", alpha=0.4, fill = 'blue')+
  geom_density(alpha=0.6, size = 2)+
  geom_vline(data=mu, aes(xintercept=mu4, color= "red" ),
             linetype="dashed", size = 1.5)+ 
  labs(title="Differences between imputed and original values",x="Proportional Difference", y = "Density") +
coord_cartesian(xlim = c(-1, 2)) 

Which results to this plot:

histogram with the coord_cartesian clause

I can not understand why the plot is empty. There are obviously values in the range -1, +2.

Edit:

Following the comments below, I changed the code filtering values above 2 and increasing the number of bins to 300. The code and output are the following:

ggplot(p %>% filter(orig != 0 & prdif < 2), aes(x= prdif) )+
  geom_histogram(aes(y=..density..), position="identity", alpha=0.4, fill = 'blue', bins = 300)+
  geom_density(alpha=0.6, size = 2, color = "yellow")+
  geom_vline(data=mu, aes(xintercept=mu4), color= "red" ,
             linetype="dashed", size = 1.5)+ 
  labs(title="Differences between imputed and original values",x="Proportional Difference", y = "Density") +
coord_cartesian(xlim = c(-1, 2)) 

enter image description here

Again the output looks weird to me. I would expect to see something like this:

enter image description here

Your advice will be appreciated.

rf7
  • 1,993
  • 4
  • 21
  • 35
  • The plot is not empty. But the bin width is wider than 3.0 `prdif` units, probably because the number of bins defaults to 30 with `geom_histogram`. I think you just need to filter out values of `prdif` > 2 _before_ you plot. You might also try plotting `x=log(prdif)` because your data looks like it might be exponentially or log-normally distributed. – bdemarest Apr 15 '17 at 07:34
  • Something like `test_data = data.frame(x = rlnorm(n = 10000, sdlog = 4))` may work as a good sample data set to make the question more reproducible. – bdemarest Apr 15 '17 at 07:38
  • 1
    I usually get this kind of plots due the far and big values. I use the 95th percentile in order to remove those extreme and low density vaules – gonzalez.ivan90 Apr 15 '17 at 15:59

0 Answers0