0

I have a data frame with observations of 6 different variables and a sample weight for each one. I am trying to use ddply with summarize to create a new data frame, in which one of the six variables becomes the first column and the others are the weighted sums of the others.

For example, my data frame looks something like this:

Location  A   B   C   D   SampleWeight
x1        2   1   4   3   .1
x1        4   3   4   2   .3
x2        2   3   7   3   .7
x2        4   3   4   3   .8
.....

I want to create a data frame that has only one value for each Location, and then sums the SampleWeights for each condition in each column as well as the total. It would look like this:

Location   Total   A2   A4   B1   B3    C4   C7  ...
x1         0.4     .1   .3   .1   .3    .4   0.0
x2         1.5     .7   .8   0.0   1.5   .8   .7

This is how I was trying to go about it, and I have gotten the total column to work, but I am not sure how to make a filter such that sum will only look at the rows with a particular value in the other column. I have included some of the things I have tried:

newdf <- ddply(mydf, ~ `Location`, summarize, 
                total = sum(`SampleWeight`),
                A2 = sum(within(`SampleWeight`, A == "2")),
                A4 = sum(filter(mydf$SampleWeight, A == "4")),
                B1 = sum((mydf$B=="1")$sample_weight)
                ...
                )

Are there easier ways to go about doing this? Is there a good way to make this kind of filer?

Thank you so much!!

1 Answers1

0

Here's a solution using dplyr and tidyr. Generate data:

library(dplyr)
library(tidyr)

df <- tibble(location     = c("x1", "x1", "x2", "x2"),
             A            = c(2, 4, 2, 4),
             B            = c(1, 3, 3, 3),
             C            = c(4, 4, 7, 4),
             D            = c(3, 2, 3, 3),
             SampleWeight = c(.1, .3, .7, .8))

Then:

res <- df %>%
  group_by(location) %>%
  mutate(total = sum(SampleWeight)) %>%
  gather(key = "letter", value = "number", A, B, C, D) %>%
  mutate(subgroup = paste0(letter, number)) %>%
  group_by(location, subgroup, total) %>%
  summarize(subgroup_total = sum(SampleWeight)) %>%
  spread(key = subgroup, value = subgroup_total) %>%
  mutate_all(function (x) ifelse(is.na(x), 0, x))

This can be broken down into four parts:

  1. group_by location and calculate the total SampleWeight
  2. gather the data into long format, and concatenate using paste0 to construct the desired subgroups (which will become columns)
  3. group_by subgroup and calculate the sum of SampleWeight, then spread the data to wide format
  4. replace NA with 0

The result:

res
  location total    A2    A4    B1    B3    C4    C7    D2    D3
1       x1   0.4   0.1   0.3   0.1   0.3   0.4   0.0   0.3   0.1
2       x2   1.5   0.7   0.8   0.0   1.5   0.8   0.7   0.0   1.5
George Wood
  • 1,914
  • 17
  • 18