Use function like cumulative sum by group or by each list element in R

Question

I have the following data:

col1 = c(rep("a",4),rep("b",8),rep("c",6), rep("d",2))
col2 = sample(-100:250, 20)
col3 = cumsum(col2)
data = data.table(col1, col2, col3)

and data.table:

    col1 col2 col3
 1:    a   56   56
 2:    a   90  146
 3:    a   85  231
 4:    a  214  445
 5:    b  -39  406
 6:    b  116  522
 7:    b   42  564
 8:    b  131  695
 9:    b  161  856
10:    b   54  910
11:    b   15  925
12:    b  229 1154
13:    c  166 1320
14:    c  224 1544
15:    c  -53 1491
16:    c   87 1578
17:    c -100 1478
18:    c  -11 1467
19:    d   28 1495
20:    d  143 1638

As you see it's just grouped by col1. I'd like to make some calculation (like cumsum, count if, etc) based on groups in col1.

In the end I'd would like to have:

col1 colsum countif>0  countif<0
a    445    4          0
b    709    7          1
c    313    3          3
d    171    2          0

@commentators Guys! Please ... I did two solutions, the first very unsightly (no sense to put it here, but is based on making a list and loop with calculation for each element of list) and second this is:

a1 = aggregate (col2 ~ col1, sum, date = date)
a2 = aggregate (col2> 0 ~ col1, sum, date = date)
a3 = aggregate (col2 <0 ~ col1, sum, date = date)
cbind (a1, a2 counfif_1 = [2], counfif_2 = a3 [2])

I'm looking just for something nice and cool.

Okay, and after reading the package vignettes where exactly are you stuck? — Roland, Jun 11 '14 at 10:44
Your question amounts to saying 'do this for me'. That is not what this site is for. — asb, Jun 11 '14 at 10:52
Please see "[What types of questions should I avoid asking?](http://stackoverflow.com/help/dont-ask)". — hrbrmstr, Jun 11 '14 at 11:04
Guys! Please ... I did two solutions, the first very unsightly and second this: a1 = aggregate (col2 ~ col1, sum, date = date) a2 = aggregate (col2> 0 ~ col1, sum, date = date) a3 = aggregate (col2 <0 ~ col1, sum, date = date) cbind (a1, a2 counfif_1 = [2], counfif_2 = a3 [2]) I'm looking just for something nice and cool. — MarekW, Jun 11 '14 at 11:13

score 1 · Answer 1 · answered Jun 11 '14 at 11:04

You can use dplyr to achieve something similar

library(dplyr)

set.seed(1)
col1 <- c(rep("a", 4), rep("b", 8), rep("c", 6), rep("d",2))
col2 <- sample(-100:250, 20)
data <- tbl_df(data.frame(col1, col2))
str(data)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    20 obs. of  3 variables:
##  $ col1: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 2 2 2 2 2 2 ...
##  $ col2: int  -7 30 99 216 -31 210 225 127 115 -79 ...


data %>%
    group_by(col1) %>%
        summarise(colsum = sum(col2),
                  countifpos = sum(col2 > 0),
                  countifneg = sum(col2 < 0))
## Source: local data frame [4 x 4]

##   col1 colsum countifpos countifneg
## 1    a    338          3          1
## 2    b    497          4          4
## 3    c    758          6          0
## 4    d    184          2          0

David Arenburg · Accepted Answer · 2014-06-11T12:09:50.393

1

data[, list(colsum = sum(col2),
            `countif>0` = sum(col2 > 0),
            `countif<0` = sum(col2 < 0)), by = col1]

##     col1 colsum countif>0 countif<0
## 1:    a    445         4          0
## 2:    b    709         7          1
## 3:    c    313         3          3
## 4:    d    171         2          0

edited Jun 11 '14 at 12:09

answered Jun 11 '14 at 11:14

David Arenburg

91,361
17
137
196

You are right! This is data.table, that's why I changed my acceptance. – MarekW Jun 11 '14 at 11:26
@MarekW Just for the records, `dplyr` works with `data.table` too! – dickoa Jun 11 '14 at 11:34
@DavidArenburg Nice `data.table` solution but I think that `sum(col2 >0)` will be probably faster than `length(col2[col2 > 0])` – dickoa Jun 11 '14 at 11:36
@dickoa, your soulution returned an error. Have you tried it? – David Arenburg Jun 11 '14 at 11:41
@dickoa, Ok, you convinced me – David Arenburg Jun 11 '14 at 12:10

shecode · Answer 3 · 2014-06-11T15:10:30.887

0

You can use tapply to get summaries by group for instance:

this is where you define the metrics you are calculating

metrics =  function(x) { c(sum(x), length(x[x<0]) , length(x[x>0]) )}

the you use the metrics function to calculate your metrics by group via a tapply function

tapply (data$col2, data$col1, metrics)
$a
[1] 241 -50 291

$b
[1] 526 -86 612

$c
[1] 483 -94 577

$d
[1] -88 -88   0

You can then convert this output into a data frame as requested

edited Jun 11 '14 at 15:10

answered Jun 11 '14 at 11:22

shecode

1,716
6
32
50

1

Sure, but data.table is more efficient for large datasets. – Roland Jun 11 '14 at 12:20
1

I don't think that even returns the desired result. You should probably change the last two `sum`s to `length` maybe? – David Arenburg Jun 11 '14 at 14:25

Use function like cumulative sum by group or by each list element in R

3 Answers3