2

I'm trying to produce a weighted sum per factor level. I have four columns of data:

col1 = surface area
col 2 = dominant
col 3 = codominant
col 4 = sub

1    2    3     4
125  A    NA    NA
130  A    NA    B
150  C    B     NA
160  B    NA    NA
90   B    A     NA
180  C    A     B
  • If only column 2 is filled, the value gets the full amount of column 1.
  • If cols 2 and 3 are filled, the value in col 1 gets split in half.
  • If cols 2, 3 and 4 are filled, the value in col 1 gets split in three.
  • If col 2 and 4 are filled, the value in col 1 gets divided as 75/25.

So, for the above example output, my new dataframe would be:

1    2
A    326.9
B    331.4
C    134.4

I fiddled around with ifelse and came op with something like (for two columns for this example):

     df1 <- df %>% 
            mutate(weighted_dominant = ifelse(!is.na(dominant) & is.na(codominant), Surface_Area, 
            Surface_Area/2),
                   weighted_codominant = ifelse(!is.na(codominant), Surface_Area/2, NA )

Now i isolate the columns of intereset:

df2 <- df1 %>% select(dominant, weighted_dominant) %>% 
               group by (dominant) %>%
               summarise (sum = sum(weighted_dominant) 

also perform this for the codominant column, bind the rows of the two new dataframes and do the summarise function again.

This gets the job done, but also takes like 50 lines of code and is, in my opinion, not very clean.

My question: Are there better (tidyverse) ways to do this kind of weighted summarisation?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Stevestingray
  • 399
  • 2
  • 12
  • I'm not sure I understand the case of "If col 1 and 4 are filled, the value in col 1 gets divided as 75/25." What gets 75 and what gets 25? Isn't there only one value (col 4) in this situation? Could you clarify further? – Ben Nov 20 '20 at 15:43
  • Hi Ben, it was a sloppy mistake from my part. It should be col2 instead of col1. So of col 2 and 4 are filled, col 2 gets `col1 * 0.75` and col 4 gets `col1 * 0.25`. – Stevestingray Nov 20 '20 at 15:53
  • I also edited my previous code, using `!is.na()` instead of `== ""` – Stevestingray Nov 20 '20 at 15:54

2 Answers2

3

Is this what you're looking for:

# your data
df <- read.table(text = "
125  A    NA    NA
130  A    NA    B
150  C    B     NA
160  B    NA    NA
90   B    A     NA
180  C    A     B", header = FALSE)
names(df) <- c("surface_area", "dominant", "codominant", "sub")


# make a matrix out of the last 3 columns
m <- as.matrix(df[2:4])


# get a logical matrix of non-NA
x <- !is.na(m)


# calculate as follow:
# T  NA  NA  ->  1    0    0
# T   T  NA  ->  1/2  1/2  0
# T  NA   T  ->  1/2  0    1/2
# T   T   T  ->  1/3  1/3  1/3
x <- x * (1/rowSums(x))


# correct exception
# T  NA   T  ->  0.75  0  0.25
exception  <- c(dominant=0.5 , codominant=0, sub=0.5 )
correction <- c(dominant=0.75, codominant=0, sub=0.25)
x[apply(x, 1, identical, exception), ] <- correction


# multiply by surface_area
x <- x * df$surface_area


# sum by each letter
stack(tapply(x, m, sum))
#>   values ind
#> 1  327.5   A
#> 2  372.5   B
#> 3  135.0   C

The final numbers are not the same, did I understand wrong your explanation? It is not really clear to me.

Edo
  • 7,567
  • 2
  • 9
  • 19
  • This looks great! Thanks for the help, this did answer my question in a Base R way, but before i accept i also want to check out Ben his answer in detail. – Stevestingray Nov 20 '20 at 16:17
3

With tidyverse you could consider the following approach.

Include row numbers as a separate column, so you can make evaluations within each row. The pivot_longer will put your data into long format.

After grouping by row number, you can determine values for A, B, and C depending on which columns are missing. This assumes there is always a "dominant" column (otherwise, you could adjust the logic here).

Then, remove your NA, and total up the weighted values for A, B, and C.

df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = c(dominant, codominant, sub)) %>%
  group_by(rn) %>%
  mutate(weight = case_when(
    is.na(value[name == "codominant"]) & is.na(value[name == "sub"]) ~ as.numeric(Surface_Area),
    is.na(value[name == "codominant"]) & name == "dominant" ~ Surface_Area * .75,
    is.na(value[name == "codominant"]) & name == "sub" ~ Surface_Area * .25,
    is.na(value[name == "sub"]) ~ Surface_Area / 2,
    TRUE ~ Surface_Area / 3
  )) %>%
  drop_na() %>%
  group_by(value) %>%
  summarise(total = sum(weight))

Output

  value total
  <chr> <dbl>
1 A      328.
2 B      372.
3 C      135 
Ben
  • 28,684
  • 5
  • 23
  • 45
  • Thank you very much, this was the answer i was looking for. I need to tweek it a little since my data frame is a shapefile, but this will work. Thanks for showing me the tidy way! – Stevestingray Nov 20 '20 at 16:19