Group/bin/bucket data in R and get count per bucket and sum of values per bucket

Question

I wish to bucket/group/bin data :

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.98047    0.07241  61000
2121699.75     0.00453  78921
71155.09375    0.0181   13794
1369809.875    0.00453  12312
750            0.2048   43451
44943.82813    0.0362   49871
85585.04688    0.0362   18947
31090.10938    0.0362   13401
68550.40625    0.0181   14345

I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.

I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?

EDIT: added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket

Typically when you cross post it's considered polite to let people know you've done so http://www.talkstats.com/showthread.php/59052-Group-bin-bucket-data-in-R-and-get-count-per-bucket-and-sum-of-values-per-bucket — Tyler Rinker, Jan 04 '15 at 05:29
Thanks Tyler, these are two different sites so I wouldn't have thought of it being neccessary — Freewill, Jan 04 '15 at 05:34
@user3007275 DO you stilll want to use `C2` as the grouping variable? — akrun, Jan 05 '15 at 03:34
thanks akrun, interesting how R is processing the do.call statement overall. Now that you have two variables C1 and C3 being grouped, the FUN is still using only one operator X instead of two -one each for C1 and C3. What is the logic here? — Freewill, Jan 05 '15 at 03:41
@user3007275 I think you have to look at `?aggregate.formula` ie. `formula: a formula, such as ‘y ~ x’ or ‘cbind(y1, y2) ~ x1 + x2’, where the ‘y’ variables are numeric data to be split into groups according to the grouping ‘x’ variables (usually factors).` — akrun, Jan 05 '15 at 03:45
thanks Akrun, i think i understand it a little better now.much appreciated. — Freewill, Jan 05 '15 at 03:49

akrun · Accepted Answer · 2015-01-05T03:36:31.197

17

From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
    breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
      labels=c('<0.005', 0.005, 0.01, 0.014)))

 res <- do.call(data.frame,aggregate(C1~group, df1, 
        FUN=function(x) c(Count=length(x), Sum=sum(x))))

 dNew <- data.frame(group=levels(df1$group))
 merge(res, dNew, all=TRUE)
 #   group C1.Count    C1.Sum
 #1 <0.005        2 3491509.6
 #2  0.005       NA        NA
 #3   0.01        2  302997.1
 #4  0.014        8  364609.5

or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".

 library(data.table)
  setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.

 library(dplyr)
 df1 %>%
      group_by(group) %>% 
      summarise_each(funs(n(), Sum=sum(.)), C1)

Update

Using the new dataset df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                             labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
       FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
#  group C1.Count    C1.Sum C3.Count C3.Sum
#1 <0.005        2 3491509.6        2  91233
#2   0.01        2  302997.1        2  88843
#3  0.014        8  364609.5        8 268809

and you can do the merge as detailed above.

The dplyr approach would be the same except specifying the additional variable

 df1%>%
      group_by(group) %>%
       summarise_each(funs(n(), Sum=sum(.)), C1, C3)
 #Source: local data frame [3 x 5]

 #  group C1_n C3_n    C1_Sum C3_Sum
 #1 <0.005    2    2 3491509.6  91233
 #2   0.01    2    2  302997.1  88843
 #3  0.014    8    8  364609.5 268809

data

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
"C2"), row.names = c(NA, -12L), class = "data.frame")

edited Jan 05 '15 at 03:36

answered Jan 04 '15 at 05:16

akrun

874,273
37
540
662

Thanks akrun, I'm also trying to understand the logic/meaning of the code you provided. I found the first one more simpler than the other two to understand. Is there a resource (book/website) that I can follow through on these 3 suggestions to understand what is happening and perhaps see an example or two be applied? – Freewill Jan 04 '15 at 05:26
Thanks, I just tried the first suggestion. C2 is actually in percentage, so when I put % sign in breaks=c(-Inf,0.005%,0.10%,0.014%,Inf), I get an error. It works if % sign isn't used. Also, after binning I need to plot this data for a histogram. How do I represent the categories e.g. the one with <=0.05 will be 0.05, <= .10 will be .10 etc. – Freewill Jan 04 '15 at 05:44
akrun, thanks. I tried the first code and it seemed to have worked. I will double check the output again. One thing I noticed it does is that if there are no values for a given bin, it omits that bin from the output. I'd like to have all the bins irrespective of whether there are any values or not. – Freewill Jan 04 '15 at 07:53
Akrun, I just noticed that your code takes C2 as a character column. Actually its a numeric value (in %). I just want it labelled as percentage but the data be in its form as is. I used the first code you had posted. Why was there a need to create a new dataset and then merge it with the old one? what is dNew doing? (the logic behind that part of the code) – Freewill Jan 04 '15 at 08:05
@user3007275 But, you said that the data had `%` as suffix (or I might have misunderstand). If it is already a numeric column, you can use my previous code. `dNew` has all the levels, while the original dataset may not have based on the `groupings`. So, if there is not a particular group, the aggregate will not return that group. By merging with the `dNew`, it is ensured to get that combination as well. – akrun Jan 04 '15 at 08:06
@user3007275 If you want to `label` as `%`, use `paste`, ie. `paste0(df1$group, '%')` – akrun Jan 04 '15 at 08:10
Great, I tried the dNew and merge and it ends up creating a new dataset with more number of obversations than the original dataset with NAs for those where the data does not exist. Also, when I plot the graphs, it doesn't come out accurately/clean and R removes some rows citing "missing values". Is there another way to get those bins in the first attempt ? – Freewill Jan 04 '15 at 08:29
@user3007275 I am not sure what values you need for those `missing combinations`. If it is `0`, just replace the `NA` with that. – akrun Jan 04 '15 at 09:57
Akrun, I'm trying to understand the "do.call" usage in the code. I understand, rather superficially, that do.call is calling a function i.e. aggregate. Aggregate is combining C1 by variable 'group' in the dataframe df1. But what is the purpose of including "data.frame" in do.call function and why do we have FUN defined within aggregate. I looked up the structure of aggregate function online , but i don't understand how do.call and aggregate are used individually and how they are being used together here. Could you direct me to a good resource to understand this? or provide some explanation? – Freewill Jan 04 '15 at 19:39
Also, with the function FUN, how come there are no curly brackets used and what is the variable 'x' ? we never call that variable or define it so how is R counting the length and summing that variable without knowing which variable we are talking about explicitly. I mean, we never say that x = C1 explicitly. What if I wanted to group C1 by variable 'group' and then wanted a sum and count of C1 as well as another variable lets say C3. – Freewill Jan 04 '15 at 19:51
@user3007275 Regarding your first comment, because we used `c(Count=length(x), Sum=sum(x))`, the result should be a matrix for the 3rd column. Check this. ` res <- aggregate(C1~group, df1, FUN=function(x) c(Count=length(x), Sum=sum(x))); str(res)`. The `do.call(data.frame,..` converts the matrix to data.frame and you will get 4 regular columns instead of 3 columns. – akrun Jan 05 '15 at 03:19
@user3007275 It is in the formula `aggregate(C1~group....`If You want both `C1 and C3`, perhaps `aggregate(cbind(C1,C3)~group,..` The expected result is not clear for me. You may have to post an example dataset. Also, if the question is not related to your original post, please do post as a separate one. – akrun Jan 05 '15 at 03:24
Akrun, I added the third column C3 in the dataset. I need the sum of C3 per bucket as well in addition to sum AND count of C1 per bucket which your code is already generating. – Freewill Jan 05 '15 at 03:33
akrun, I just checked "res<-aggregate(C1~group,df1,FUN=function(x) c(Count=length(x), Sum=sum(x))); str(res) and it shows res as a data frame and not a matrix. – Freewill Jan 05 '15 at 04:19
@user3007275 Check the `str(res)` You will see that the `C1` column is `matrix`.`is.matrix(res$C1)#[1] TRUE` – akrun Jan 05 '15 at 04:20

Group/bin/bucket data in R and get count per bucket and sum of values per bucket

1 Answers1

Update

data

Linked

Related