0

I want to plot a histogram with ggplot of the counts of the variable. However, I want the bars to each show the relative fraction of a second (categorical) variable.

For example the sum of four variable is always 1. I want to plot a histogram based on the counts variable.

library(reshape)
library(ggplot2)

values= replicate(4, diff(c(0, sort(runif(92)), 1)))
 colnames(values) = c("A","B","C","D")
 counts = sample(1:100, 93, replace=T)
 df = data.frame(cbind(values,"count"=counts))
 mdf = melt(df,id="count")



ggplot(mdf, aes(count,fill=variable)) +
  geom_histogram(alpha=0.3, 
   position="identity",lwd=0.2,binwidth=5,boundary=0)

I want each bars of historgram to be coloured based on the on the relative fraction of column(A,B,C,D). so each bin should have four categorical variables.

Elin
  • 6,507
  • 3
  • 25
  • 47
user3978632
  • 283
  • 4
  • 17
  • @ Jimbou I used library "reshape" and "ggplot2". – user3978632 Apr 23 '18 at 13:21
  • add `+ facet_grid(~variable)` to your plot. Then you will see that your code is working but all bars have the same hight. – Roman Apr 23 '18 at 13:22
  • 1
    I don't think what you want is a histogram. You want a stacked bar chart. If you search for that you will find lots of answers. I had to read this multiple times to understand what you were asking which does not match the title at all. – Elin Apr 23 '18 at 13:23
  • @Elin I need a histogram not a bar plot. I just need each bin of histogram coloured in to 4 different colour based on the relative values of columns (A,B,C,D). Is it clear now? – user3978632 Apr 23 '18 at 13:30
  • No you really don't. Histograms are for displaying the probability distributions of single quantitative (continuous) variables. Bar charts are for discrete (including categorical) variables displaying statistics (e.g. counts). – Elin Apr 23 '18 at 13:32
  • 2
    You need to formulate your quesiton in a better way, this is just too vague. – Amir Apr 23 '18 at 13:38
  • @Elin I got your point but I think I did not make it clear to you. I need something like this [link] (https://www.google.nl/search?biw=2133&bih=1000&tbm=isch&sa=1&ei=KFbbWsbhK43SwAKI-ZfgCg&q=histogram+categorical+data+in+R&oq=histogram+categorical+data+in+R&gs_l=psy-ab.3..0i24k1.1388.1986.0.2130.5.2.0.3.3.0.63.109.2.2.0....0...1c.1.64.psy-ab..0.5.145....0.lRpBlm_d0sY#imgrc=4JFWI7mXmS2XCM:) but there they used the categorical variables (as factor) for each bins here i want to plot them based on the columns of df. – user3978632 Apr 23 '18 at 13:39
  • 2
    You have `position = "identity"` in your `geom_histogram`, which means the bars are placed over each other and you're only seeing one color. Remove that bit. That's also how they did it in the example you linked to, so I'm not sure why you added `position = "identity"` – camille Apr 23 '18 at 13:45
  • The x axis is count and the total height is the sum of the original 4 columns, is that correct? – Elin Apr 23 '18 at 14:03
  • @ Elin yes. x axis is the count and y is the sum of 4 variables. – user3978632 Apr 23 '18 at 14:12

2 Answers2

1

I think this is what you want (I used dplyr package as well):

library(reshape2)
library(ggplot2)
library(dplyr)

set.seed(2)
values= replicate(4, diff(c(0, sort(runif(92)), 1)))
colnames(values) = c("A","B","C","D")
counts = sample(1:100, 93, replace=T)
df = data.frame(cbind(values,"count"=counts))
mdf = melt(df,id="count")

mdf = mdf %>%
  mutate(binCounts = cut(count, breaks = seq(0, 100, by = 5))) %>%
  group_by(binCounts) %>%
  mutate(sumVal = sum(value)) %>%
  ungroup() %>%
  group_by(binCounts, variable) %>%
  summarise(prct = sum(value)/mean(sumVal))

plot = ggplot(mdf) +
  geom_bar(aes(x=binCounts, y=prct, fill=variable), stat="identity") +
  theme(axis.text.x=element_text(angle = 90, hjust=1))

print(plot)

enter image description here

LetEpsilonBeLessThanZero
  • 2,395
  • 2
  • 12
  • 22
  • @ LetEpsilonBeLessThanZero I need similar graph but in your graph you divide each bin equally i.e. 0.25 (as you are using the count).I want the count on Y axis but i want each bin to be divided based on the sum of A B C and D that falls under the binCount category. – user3978632 Apr 23 '18 at 14:36
  • When you say "sum of A B C and D", do you mean the sum of the "value" column? Or the sum of the "count" column? – LetEpsilonBeLessThanZero Apr 23 '18 at 14:40
  • Yes the sum of value column, for eg for first bar in your plot value is around 8. and lets say in A = 0.25 B= 0.65 C = 0.05 D = 0.05 then I want the 4 colours to be in proportions. 65%(blue colour) should be C and so on – user3978632 Apr 23 '18 at 14:44
  • Hmm, okay, I think I understand now. I've edited my post. If I've understood you correctly, then I feel I have to say that this isn't a histogram. It is a 100% stacked bar chart. I think it also would've helped us get your answer sooner if you had specified that the y-axis is concerned about the "value" column. You never said that. – LetEpsilonBeLessThanZero Apr 23 '18 at 15:12
  • @ LetEpsilonBeLessThanZero Thanks for your help and time. I did not want this. I didnot made my self clear though. I want something like this [link](https://stackoverflow.com/questions/49984624/colour-bins-of-histogram-in-r). I want the Y column to be count for each bin but the colours should be in proportion to the sum of the variable(A,B,C,D). – user3978632 Apr 23 '18 at 15:29
  • Okay, bro, you're on your own. You're making no sense. You keep repeating "sum of the variable(A,B,C,D)" but you can't sum categorical variables. You have two numeric variables in your dataframe and they are "count" and "value". You can't be wanting to sum "count", because every bin has equal number of A's, B's, C's, and D's. Any graph plotting anything related to counts on the y-axis would be pointless for this reason. That means you have to be wanting to sum the "value" column, and you said you wanted each bin to be proportional such that it summed to 1. That's exactly what I provided you. – LetEpsilonBeLessThanZero Apr 23 '18 at 15:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169614/discussion-between-user3978632-and-letepsilonbelessthanzero). – user3978632 Apr 23 '18 at 18:16
1

I found the answer with the help of others in this post. I want each bar of the plot as the fraction of the variables in (A,B,C,D).Though the code is not elegant. Might be helpful for someone !! enter image description here

library(reshape2)
library(ggplot2)
library(dplyr)

##generate the random variables that sum to 1 for each columns
values <- matrix(runif(100*4),nrow=100) 
S <- apply(values,1,sum); values = values/S 
colnames(values) = c("A","B","C","D")
set.seed(2)
counts = sample(1:100, 100, replace=T)

##frequency of the data in binwidth of 5
table = hist(counts,breaks=seq(0, 100, by = 5),plot=F)$counts

##create a dataframe
df = data.frame(cbind(values,"count"=counts))


breaks = seq(5, 100, by = 5)
newdf = do.call("rbind",lapply(as.numeric(breaks), function(x) apply(df[which(df$count < x),][,1:4],2,sum)))
newdf = melt(sweep(newdf, 1, rowSums(newdf), FUN="/") * table)
colnames(newdf) = c("bins","variable","value")
ggplot(newdf) +
  geom_bar(aes(x=bins, y=value, fill=variable), stat="identity") +
  theme(axis.text.x=element_text(angle = 90, hjust=1))
user3978632
  • 283
  • 4
  • 17