Statistics on a table with `value' column and `number of events' column

Question

I have a table which looks like

value(0 < v< 1),  # of events
---------------   -----------
0.1,              1000
0.5,              20000
0.7,              3000000
0.1,              400000000
0.5,              50000000000
0.9,              6000000000000
...,              ...

the value can take any number from 0 to 1 possibly with repetition and the number of event is so large that it is inefficient to transform this into a usual form of vector, like

0.1,0,1,...0.1, 0.5,0.5,0.5, ...

When I try to apply some function, eg plot(), to this table, R does not identify rows with same value but treats them separately. What would be a good way of doing some statistics with this kind of table as if we have the following table?

value,  # of events
0.1,    400001000
0.5,    ...

score 5 · Answer 1 · edited May 23 '17 at 12:21

Your question is a bit unclear, but I think you just want to sum events by each unique value? If so, there are multitudes of answers to this and related questions. Here's one approach:

#fake data
set.seed(1)
x <- data.frame(value = 1:3, events = sample(1:10, 9, TRUE))

#Option 1

 aggregate(events ~ ., data = x, FUN = "sum")
  value events
1     1     23
2     2     14
3     3     22

#Option 2
> tapply(x$events, x$value, FUN = "sum")
 1  2  3 
23 14 22 

#Option 3
> library(plyr)
> ddply(x, "value", summarize, sum = sum(events))
  value sum
1     1  23
2     2  14
3     3  22

#Option 4
> library(data.table)
> x <- data.table(x)
> x[, sum(events), by = value]
     value V1
[1,]     1 23
[2,]     2 14
[3,]     3 22

These solutions (and others) scale differently as your data grows. I gave a pretty comprensive answer to compare timings and methods here

Answers with various options - this really helps. Thanks a lot! — HBS, Sep 08 '12 at 19:40

score 1 · Answer 2 · answered Sep 08 '12 at 18:56

As a first step, here's how to convert your first table into the second form.

Construct data:

dd <- setNames(as.data.frame(matrix(c(0.1,1000,
                                      0.5,20000,
                                      0.7,3000000,
                                      0.1,400000000,
                                      0.5,50000000000,
                                      0.9,6000000000000),
                                    ncol=2,byrow=TRUE)),
                             c("value","count"))

Use tapply to condense the data

dd2 <- tapply(dd$count,dd$value,sum)

Then use melt to get the data into a (possibly) more useful format:

library(reshape2)
(dd3 <- melt(dd2,varnames="value",value.name="count"))
##   value        count
## 1   0.1 4.000010e+08
## 2   0.5 5.000002e+10
## 3   0.7 3.000000e+06
## 4   0.9 6.000000e+12

You may want to be careful when adding very small and very large numbers.

For the "what kind of statistics should I do?" part of the question -- sorry, that's too vague. What do you want to find out ... ???

Sorry for being vague. I was wondering how do I apply basic functions like mean or median to this table to get a sense of dealing with data in R. As you already noticed, I'm new to R but I have to learn this as quickly as possible :-) — HBS, Sep 08 '12 at 19:37

IRTFM · Answer 3 · 2012-09-08T21:55:33.157

1

If you want a weighted mean:

 weighted.mean(dd$value, dd$count)
[1] 0.8966414

Weighted median: (and there are several other 'weighted' functions in Hmisc)

 library(Hmisc)
 wtd.quantile(dd$value, dd$count, .5)
#50% 
#0.9

To plot just use barplot

 barplot(dd$count)  #perhaps with log="y"
 barplot(dd$count, log="y")

edited Sep 08 '12 at 21:55

answered Sep 08 '12 at 21:45

IRTFM

258,963
21
364
487

Statistics on a table with `value' column and `number of events' column

3 Answers3