Lets say I have a dataframe that looks like this:
groups <- floor(runif(1000, min=1, max=5))
activity <- rep(c("A1", "A2", "A3", "A4"), times= 250)
endorsement <- floor(runif(1000, min=0, max=2))
value1 <- runif(1000, min=1, max=10)
area <- rep(c("A", "A", "A", "A", "B", "C", "C", "D", "D", "E"), times = 100)
df <- data.frame(groups, activity, endorsement, value1, area)
printed:
> head(df)
groups activity endorsement value1 area
1 1 A1 0 7.443375 A
2 1 A2 0 4.342376 A
3 1 A3 0 4.810690 A
4 4 A4 0 3.494974 A
5 3 A1 1 6.442354 B
6 1 A2 0 9.794138 C
I want to calculate some descriptive statistics and create some bar charts, but if you look at the area
variable, A
is very well represented, whereas B
and E
are not.
I'm not interested in the area
variable itself, but the stats/plot will be driven by areas that have high representation in the dataset, so I need to weight the data but I'm not sure the correct way to do it in the following situations:
Mean and SD
I'm calculating the mean and SD or value1
as follows:
df %>% group_by(groups) %>% summarise(mean=mean(value1), sd=sd(value1))
Whats the correct way to calculate a weighted mean/sd to compensate for differences in sample size for each area (i.e. I want to give each area
equal weight)?
Stacked bar chart
ggplot(df, aes(groups)) +
geom_bar(aes(fill = activity), position = position_fill(reverse = F))
The bars represent the proportions for how often each activity
occured in each group
. Again, this is driven mostly by respondents from area A - is there a way to balance this and calculate proportions as if area
has equal representation?
Grouped means
ggplot(aes(x = activity, y = value1, fill=factor(groups)), data=df) +
geom_bar(position="dodge", stat="summary", fun.y="mean")+
guides(fill = guide_legend(reverse=F, title="group"))
The bars represent the average of value1
for each group
and activity
combination. Again, these averages are weighted in favour of Area A, and representation is not equal
Grouped count proportions
summary_df <- df %>% group_by(groups, activity) %>%
summarise(n=n(), count=sum(endorsement)) %>% mutate(prop=(count/n)*100)
ggplot(aes(x = activity, y = prop, fill = factor(groups)), data=summary_df) +
geom_bar(width=0.8, position = position_dodge(width=0.8), stat="identity") +
guides(fill = guide_legend(reverse=F, title="group"))
For each group
and activity
combination, I'm counting the number of people that endorsed the item (responded 1
), and calculating a proportion of total people in the subgroup
The 4 problems above all stem from the same problem and all need to be weighted by area
to create equal representation. However, the visualizations are all created differently and showing different things (means, stacked bars, grouped means, count proportions) and I'm not sure the correct way to account for sample size differences in each case. Is there a single fix that will propagate to each of the graph examples?