Automatically assign weights to variables based on factor level

Question

I am having some trouble wording my issue, so I am using the mtcars dataset as an example.

Imagine I am student of social sciences in the Pixar Cars(TM) universe. For a small school project on statistical methods, I am doing a survey amongst my peers. My target is to collect data on a sample of 30 cars, half of which are automatic, and the other half is manual. After my online survey is closed, and I have cleaned up my data, it looks like the mtcars dataset.

data(mtcars)
str(mtcars)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual") # because anthropomorphic cars prefer factors with levels over binary code

If I use table(mtcars$am), I find out that there were 19 automatic and 13 manual transmission cars in the dataset. Looks like I didn't make the target to have an equal number of manual and automatic cars :(! Luckily, as a car-sociologist, I can fix this by weighing my dataset. I divide the target # by the collected # to get the weight of each observation. Thus, all automatic cars should get a weight of 0.7894 (19/15) and manual cars a weight of 1.1538 (13/15). Assigning the correct weight to each observation is a fairly straightforward:

mtcars$weight <- ifelse(mtcars$am == "automatic", 0.7894737, 1.153846)

You can imagine that this method becomes a bit cumbersome with larger datasets with more weight-categories. Is there a way to automate the process of assigning the weights to each observation?

As a car and self-taught R-user who mainly cobbles things together as-needed, I don't really know where to start. I've been using the method above, but due to an enlarged number of target-groups it's not really sustainable anymore.

I of course did attempt to find the answer elsewhere on the WWW, but not very successfully unfortunately. The following question seemed promising, but doesn't provide a solution for me:

R: new variable values based on factor levels of another variable

Seth · Answer 1 · 2023-03-29T15:44:03.853

In this example, we add a count of each transmission group, then create a weight variable by dividing your expected group size (half the total) by the observed group size (the count).

library(dplyr)

mtcars %>%
  mutate(am = factor(am, labels = c('automatic','manual'))) %>%
  add_count(am) %>%
  mutate(weight = (n()/2)/n)
#>     mpg cyl  disp  hp drat    wt  qsec vs        am gear carb  n    weight
#> 1  21.0   6 160.0 110 3.90 2.620 16.46  0    manual    4    4 13 1.2307692
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0    manual    4    4 13 1.2307692
#> 3  22.8   4 108.0  93 3.85 2.320 18.61  1    manual    4    1 13 1.2307692
#> 4  21.4   6 258.0 110 3.08 3.215 19.44  1 automatic    3    1 19 0.8421053
#> 5  18.7   8 360.0 175 3.15 3.440 17.02  0 automatic    3    2 19 0.8421053
#> 6  18.1   6 225.0 105 2.76 3.460 20.22  1 automatic    3    1 19 0.8421053
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0 automatic    3    4 19 0.8421053
#> 8  24.4   4 146.7  62 3.69 3.190 20.00  1 automatic    4    2 19 0.8421053
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1 automatic    4    2 19 0.8421053
#> 10 19.2   6 167.6 123 3.92 3.440 18.30  1 automatic    4    4 19 0.8421053
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1 automatic    4    4 19 0.8421053
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0 automatic    3    3 19 0.8421053
#> 13 17.3   8 275.8 180 3.07 3.730 17.60  0 automatic    3    3 19 0.8421053
#> 14 15.2   8 275.8 180 3.07 3.780 18.00  0 automatic    3    3 19 0.8421053
#> 15 10.4   8 472.0 205 2.93 5.250 17.98  0 automatic    3    4 19 0.8421053
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0 automatic    3    4 19 0.8421053
#> 17 14.7   8 440.0 230 3.23 5.345 17.42  0 automatic    3    4 19 0.8421053
#> 18 32.4   4  78.7  66 4.08 2.200 19.47  1    manual    4    1 13 1.2307692
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1    manual    4    2 13 1.2307692
#> 20 33.9   4  71.1  65 4.22 1.835 19.90  1    manual    4    1 13 1.2307692
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1 automatic    3    1 19 0.8421053
#> 22 15.5   8 318.0 150 2.76 3.520 16.87  0 automatic    3    2 19 0.8421053
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0 automatic    3    2 19 0.8421053
#> 24 13.3   8 350.0 245 3.73 3.840 15.41  0 automatic    3    4 19 0.8421053
#> 25 19.2   8 400.0 175 3.08 3.845 17.05  0 automatic    3    2 19 0.8421053
#> 26 27.3   4  79.0  66 4.08 1.935 18.90  1    manual    4    1 13 1.2307692
#> 27 26.0   4 120.3  91 4.43 2.140 16.70  0    manual    5    2 13 1.2307692
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1    manual    5    2 13 1.2307692
#> 29 15.8   8 351.0 264 4.22 3.170 14.50  0    manual    5    4 13 1.2307692
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0    manual    5    6 13 1.2307692
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0    manual    5    8 13 1.2307692
#> 32 21.4   4 121.0 109 4.11 2.780 18.60  1    manual    4    2 13 1.2307692

^{Created on 2023-03-29 with reprex v2.0.2}

jay.sf · Accepted Answer · 2023-03-30T03:52:58.770

Generally, you have sample proportions that exceed or fall short of the expected population proportions. So you want to weight the sample proportions to bring them in line with population proportions. You can get the weights by dividing the former by the latter.

Let's demonstrate this by the number of carburetors provided in mtcars. Say the known/expected proportion is:

carb_pop <- c(.25, .28, .1, .28, .05, .04) |> setNames(c(1:4, 6, 8))
carb_pop
#    1    2    3    4    6    8 
# 0.25 0.28 0.10 0.28 0.05 0.04

However, in the sample we have:

carb_smp <- table(mtcars$carb)
proportions(carb_smp)
#       1       2       3       4       6       8 
# 0.21875 0.31250 0.09375 0.31250 0.03125 0.03125

Now we can create a named vector w with weights:

w <- carb_pop/proportions(carb_smp)
w
#        1        2        3        4        6        8 
# 1.142857 0.896000 1.066667 0.896000 1.600000 1.280000

that brings the proportions in line,

all(carb_pop == w*proportions(carb_smp))
# [1] TRUE

We now can use the named vector to create weights in a match approach similar to that you've seen in your linked question.

mtcars$weights <- w[match(mtcars$carb, names(w))]

Gives

head(mtcars)
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  weights
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.896000
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.896000
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1.142857
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1.142857
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.896000
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1.142857

Automatically assign weights to variables based on factor level

2 Answers2

Gives