2

I have a data frame that looks like this:

x   y   value   weight
10  1   red 1
3.4 5   blue    2
10  10  blue    0.5
3   8   blue    0.5
3   8   red 4
10  1   blue    1
3   8   blue    2
3   8   blue    0.25

I want to recast it so that each row is a unique combination of "x" and "y", and the columns are the sums of the "weight"s for each different value of "value". If possible, I'd also like to have columns for raw counts of the values of "value". So for these data that would be:

x   y   red_count   blue_count  red_sum_of_weights  blue_sum_of_weights
10  1   1   1   1   1
3.4 5   0   1   0   2
10  10  0   0   1   0.5
3   8   1   3   4   2.75

Is there a way to do this with reshape or reshape2? I can get it to do counts of the values with

dcast(data,x+y~value)

But I haven't been able to figure out how to get it to use the weights in the way I want it to. I need this to work with an arbitrary number of possible levels for value and an arbitrary different number of rows per x*y combination in the original dataset. I've written my own code just using for loops to do this, but it takes an extremely long time to run - so far has taken 6 hours to do the first 15% of a 600k row dataset, which isn't very practical! But I'm sure there must be a way to do this with existing functions?

Thanks very much in advance for any help!

tzirtzi
  • 75
  • 11
  • For counts, add `fun.aggregate=length` (although that's the default, so it should do that anyway if you have more than one value in at least one cell). – eipi10 Jul 21 '16 at 15:03
  • Thanks very much for your comment! Yes, I realise that's the default, which is why, as I mentioned, I've been able to use dcast(data,x+y~value) to get raw counts. What I can't work out is how to get the sums of values from "weight" for each distinct value of "value". – tzirtzi Jul 21 '16 at 15:21
  • If you were just using `dcast`, it would be `fun.aggregate=sum`, but in your case you need a combination of split-apply-combine functions. See my answer for details. – eipi10 Jul 21 '16 at 15:26

4 Answers4

2

You can do this with a combination of dplyr and reshape2 functions. First, group by x, y and value (we change the name of the latter to color only to avoid duplicate column names after melting) and then calculate the count and sum for each sub-group. Then meltthe result to put the newly calculated summaries in "long" format. Finally, dcast to get the "wide" format you asked for.

library(reshape2)
library(dplyr)

df %>% group_by(x,y,color=value) %>% 
  summarise(count=n(), sum=sum(weight)) %>%
  melt(id.var=c("x","y","color")) %>%
  dcast(x + y ~ variable + color)
     x  y count_blue count_red sum_blue sum_red
1  3.0  8          3         1     2.75       4
2  3.4  5          1        NA     2.00      NA
3 10.0  1          1         1     1.00       1
4 10.0 10          1        NA     0.50      NA
eipi10
  • 91,525
  • 24
  • 209
  • 285
2

Another option:

df %>% 
  group_by(x, y, value) %>% 
  summarise(count = n(), sum = sum(weight)) %>%
  gather(key, val, -(x:value)) %>%
  unite(newkey, value, key) %>%
  spread(newkey, val)

Which gives:

#Source: local data frame [4 x 6]
#Groups: x, y [4]
#
#      x     y blue_count blue_sum red_count red_sum
#* <dbl> <int>      <dbl>    <dbl>     <dbl>   <dbl>
#1   3.0     8          3     2.75         1       4
#2   3.4     5          1     2.00        NA      NA
#3  10.0     1          1     1.00         1       1
#4  10.0    10          1     0.50        NA      NA
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
1

Here's a straightforward solution using data.table::dcast:

require(data.table)
dcast(dt, x + y ~ value, value.var = "weight", fun.aggregate = list(length, sum))
#       x  y weight_length_blue weight_length_red weight_sum_blue weight_sum_red
# 1:  3.0  8                  3                 1            2.75              4
# 2:  3.4  5                  1                 0            2.00              0
# 3: 10.0  1                  1                 1            1.00              1
# 4: 10.0 10                  1                 0            0.50              0

where,

dt = fread('x   y   value   weight
           10  1   red 1
           3.4 5   blue    2
           10  10  blue    0.5
           3   8   blue    0.5
           3   8   red 4
           10  1   blue    1
           3   8   blue    2
           3   8   blue    0.25
           ')
Arun
  • 116,683
  • 26
  • 284
  • 387
0

Another approach with the data table :

require(data.table)

count=dcast(df,x+y~paste(value,"_count",sep=""))
weights=dcast(df,x+y~paste(value,"_sum_of_weights",sep=""),value.var = "weight",fun.aggregate = sum)
result=merge(count,weights,by=c("x","y"))

Result:

    x  y blue_count red_count blue_sum_of_weights red_sum_of_weights
 10.0  1          1         1                1.00                  1
 10.0 10          1         0                0.50                  0
  3.0  8          3         1                2.75                  4
  3.4  5          1         0                2.00                  0

Although I have to admit it would have been much simpler if we could use two functions in dcast. Supposedly you can but I keep getting an error... After some digging I found a very nice answer in this thread reshape2: multiple results of aggregation function? which defines a wrapper function for dcast as follows:

dcastMult <- function(data, formula, value.var = "value", 
                   funs = list("min" = min, "max" = max)) {
  require(reshape2)
  if (is.null(names(funs)) | any(names(funs) == "")) stop("funs must be named")
  Form <- formula(formula)
  LHS <- as.character(Form[[2]])
  if (length(LHS) > 1) LHS <- LHS[-1]
  temp <- lapply(seq_along(funs), function(Z) {
    T1 <- dcast(data, Form, value.var = value.var, 
                fun.aggregate=match.fun(funs[[Z]]), fill = 0)
    Names <- !names(T1) %in% LHS
    names(T1)[Names] <- paste(names(T1)[Names], names(funs)[[Z]], sep = "_")
    T1
  })
  Reduce(function(x, y) merge(x, y), temp)
}

Using this lovely function we get the result as follows:

result=dcastMult(df,x+y~value,funs = list("count"=length,"sum_of_weights"=sum),value.var = "weight")
Community
  • 1
  • 1
  • 1
    `data.table::dcast` allows multiple `value.var` and `fun.aggregate`.. I've posted an answer. – Arun Jul 22 '16 at 14:18
  • Yes you are right! I just had not updated the version of data table on that machine. Side question: Can you set the names in the result data table somehow? –  Jul 22 '16 at 14:29