R - Weighting by group size in bar charts

Question

Lets say I have a dataframe that looks like this:

groups <- floor(runif(1000, min=1, max=5))
activity <- rep(c("A1", "A2", "A3", "A4"), times= 250)
endorsement <- floor(runif(1000, min=0, max=2))
value1 <- runif(1000, min=1, max=10)
area <- rep(c("A", "A", "A", "A", "B", "C", "C", "D", "D", "E"), times = 100)

df <- data.frame(groups, activity, endorsement, value1, area)

printed:

> head(df)
  groups activity endorsement   value1 area
1      1       A1           0 7.443375    A
2      1       A2           0 4.342376    A
3      1       A3           0 4.810690    A
4      4       A4           0 3.494974    A
5      3       A1           1 6.442354    B
6      1       A2           0 9.794138    C

I want to calculate some descriptive statistics and create some bar charts, but if you look at the area variable, A is very well represented, whereas B and E are not.

I'm not interested in the area variable itself, but the stats/plot will be driven by areas that have high representation in the dataset, so I need to weight the data but I'm not sure the correct way to do it in the following situations:

Mean and SD

I'm calculating the mean and SD or value1 as follows:

df %>% group_by(groups) %>% summarise(mean=mean(value1), sd=sd(value1))

Whats the correct way to calculate a weighted mean/sd to compensate for differences in sample size for each area (i.e. I want to give each area equal weight)?

Stacked bar chart

ggplot(df, aes(groups)) +
  geom_bar(aes(fill = activity), position = position_fill(reverse = F))

The bars represent the proportions for how often each activity occured in each group. Again, this is driven mostly by respondents from area A - is there a way to balance this and calculate proportions as if area has equal representation?

Grouped means

ggplot(aes(x = activity, y = value1, fill=factor(groups)), data=df) +
  geom_bar(position="dodge", stat="summary", fun.y="mean")+
  guides(fill = guide_legend(reverse=F, title="group"))

The bars represent the average of value1 for each group and activity combination. Again, these averages are weighted in favour of Area A, and representation is not equal

Grouped count proportions

summary_df <- df %>% group_by(groups, activity) %>%
    summarise(n=n(), count=sum(endorsement)) %>% mutate(prop=(count/n)*100)

ggplot(aes(x = activity, y = prop, fill = factor(groups)), data=summary_df) +
  geom_bar(width=0.8, position = position_dodge(width=0.8), stat="identity") +
  guides(fill = guide_legend(reverse=F, title="group"))

For each group and activity combination, I'm counting the number of people that endorsed the item (responded 1), and calculating a proportion of total people in the subgroup

The 4 problems above all stem from the same problem and all need to be weighted by area to create equal representation. However, the visualizations are all created differently and showing different things (means, stacked bars, grouped means, count proportions) and I'm not sure the correct way to account for sample size differences in each case. Is there a single fix that will propagate to each of the graph examples?

mtoto · Accepted Answer · 2017-03-24T21:36:42.570

One strategy would be to down- or up-sample your dataframe so that each area has the same number of observations. We can use the convenience functions downSample()or upSample() from the caret package, which according to the documentation:

"Simple random sampling is used to down-sample for the majority class(es). Note that the minority class data are left intact..."

To illustrate:

library(dpyr)
library(caret)
# Before
df %>% group_by(area) %>% summarise(n())
#   area `n()`
#1      A   400
#2      B   100
#3      C   200
#4      D   200
#5      E   100

# After
set.seed(123)
test_down <- downSample(df, df$area)
test_down %>% group_by(area) %>% summarise(n())
#    area `n()`
#1      A   100
#2      B   100
#3      C   100
#4      D   100
#5      E   100

test_up <- upSample(df, df$area)
test_up %>% group_by(area) %>% summarise(n())
#    area `n()`
#1      A   400
#2      B   400
#3      C   400
#4      D   400
#5      E   400

So then your first graph becomes:

library(ggplot2)
ggplot(test_down, aes(groups)) +
        geom_bar(aes(fill = activity), 
                 position = position_fill(reverse = F))

Note that because we use random sampling, we have no control over which observations get omitted when using downSample(). Hence, the results might look slightly different at each run without set.seed().

thanks. Is there another option besides downsampling? Because the distribution of `area` is very unbalanced - some areas have almost 100x more data than others, so downsampling means we'd lose quite a lot from the overrepresented areas — Simon, Mar 24 '17 at 21:27

R - Weighting by group size in bar charts

1 Answers1