1

For a data frame populated from a SQL query which looks like this:

Company        Month         Total_Count
ABC            2012-03         10
ABC            2009-01         1
DEF            2011-01         29
GHI            2001-09         10
GHI            2008-01         9
GHI            2010-09         100

I would like to come up with a summary report that would output this:

Company          Total         Months
ABC              11            10         1
DEF              29            29
GHI              119           10      9    100

where column Total would be the sum across all the months
and column Months would be the individual counts for each month encountered for that org. I would love to see if it is possible to somehow squeeze in the actual Month also. Something like:

Company          Total         Months
                               2009-01    2012-03
ABC              11            10      1
                               2011-01
DEF              29            29
                               2001-09   2008-01   2010-09
GHI              119           10        9         100

So far, I have only been able to get the first 2 columns of the desired result above by doing:

df1 <- ddply(resultSet, .(company_name), summarize, sum(as.numeric(total_count)))

But I am not sure how I can append information for individual months

name_masked
  • 9,544
  • 41
  • 118
  • 172
  • Those are probably not Date or Date-time columns in R. Youare probably aggregating on 'factor' or 'character' classed variables. – IRTFM Sep 14 '12 at 00:13

2 Answers2

1
library(reshape2)
res <- ddply(resultSet, .(Company), summarize, Total = sum(Total_Count))
# if value.var is not specified, dcast uses the last column. see ?guess_value
months <- dcast(resultSet, Company ~ Month, value.var = "Total_Count", sum)
join(res, months, by = "Company")
  Company Total 2001-09 2008-01 2009-01 2010-09 2011-01 2012-03
1     ABC    11      NA      NA       1      NA      NA      10
2     DEF    29      NA      NA      NA      NA      29      NA
3     GHI   119      10       9      NA     100      NA      NA

it was important in context of the question to have the aggregrate function.

name_masked
  • 9,544
  • 41
  • 118
  • 172
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • I did not understand how did the function `dcast` use `Total_Count` column to generate that table. What if tomorrow I have another Count column denoting something else. Is it possible to specify which column can be used to spread out the table ? – name_masked Sep 12 '12 at 22:15
  • It automatically guessed `Total_Count` since that was the last (and in this case, only) column. You can explicitly specify this using `value.var`. By default it guesses that the last column is the value column. See `?guess_value` for more information. – Maiasaura Sep 12 '12 at 22:24
  • So, I added another count column in the data frame. Even after mentioning `value.var` to the new column, the `dcast` seems to be referring to the old column when creating the new table. I am not sure if there is a bug in the code or if I am not doing something right. Can you please help me? – name_masked Sep 13 '12 at 16:50
  • The reason for the miscalculation in my results was because we did not have aggrgrate function name i.e. `month <- dcast(resultSet, Company ~ Month, value.var="Total_Count", sum)`. That would be the complete solution. – name_masked Sep 13 '12 at 21:36
0

Since you gave it no name I chose "cdf"

agg1 <- aggregate(namcounts, cdf['Company'], function(x) sumC=sum(x) )
aggC <- lapply( split(namcounts, cdf['Company']), function(x)  listC= list(x) ) 
agg1$clist <- sapply(aggC, "[[" ,1)
agg1
  Company   x      clist
1     ABC  11      10, 1
2     DEF  29         29
3     GHI 119 10, 9, 100

The other way to record this would be to paste the month-counts with a separator. At the moment these are stored as a list.

IRTFM
  • 258,963
  • 21
  • 364
  • 487