How to count number of occurence in a large dataset

Question

I'm trying to count the number of occurence of each "scenarios" that I have (0 to 9) in a data frame over 25 years. Basically, I have 10000 simulations of scenarios named 0 to 9, each scenario having a probability of occurence.

My dataframe is too big to paste in here but here's a preview:

simulation=as.data.frame(replicate(10000,sample(c(0:9),size=25,replace=TRUE,prob=prob)))

simulation2=transpose(simulation)

Note** prob is a vector with the probability to observe each scenario

   v1 v2 v3 v4 v5 v6 ... v25
1   0  0  4  0  2  0      9
2   1  0  0  2  3  0      6
3   0  4  6  2  0  0      0
4
...
10000

This is what I have tried so far:

for (i in c(1:25)){
  for (j in c(0:9)){
f=sum(simulation2[,i]==j);
vect_f=c(vect_f,f)
  }
  vect_f=as.data.frame(vect_f)
}

If I omit the "for (i in c(1:25))", this returns me the right first column of the output desired. Now I am trying to replicate this over 25 years. When I put the second 'for' I do not get the output desired.

The output should look like this :

      (Year) 1  2  3  4  5  6   ... 25
(Scenario)
   0         649
   1         239
   ...
   9          11

649 being the number of times 'scenario 0' is observed the first year over my 10 000 simulations.

Thanks for your help

score 1 · Accepted Answer · answered Jun 27 '19 at 04:55

1

We can use table

sapply(simulation2, table)

#    V1   V2   V3   V4   V5 .....
#0 1023 1050  994 1016 1022 .....
#1 1050  968  950 1001  981 .....
#2  997  969 1004  999  949 .....
#3 1031  977 1001  993 1009 .....
#4 1017 1054 1020 1003  985 .....
#......

If there are certain values missing in a column we can convert the numbers to factor including all levels

sapply(simulation2, function(x) table(factor(x, levels = 0:9)))

answered Jun 27 '19 at 04:55

Ronak Shah

377,200
20
156
213

What if this time I want to simulate the number of scenario (0 to 9) for 1 year first , count how much of each scenario I would have for 25 years and then replicate 10 000 times ? – Jng Jun 27 '19 at 13:15
@Jng Sorry, I don't understand what you mean. If you want to calculate it for only one column you can do `table(simulation2[1])` – Ronak Shah Jun 27 '19 at 14:09
Well, right now I'm simulating 10 000 the number of scenarios (0:9) over 25 years. What I want is to simulate over the span of 25 years the occurence of scenario (0:9) and then repeat that 10 000 times. My output would be a dataframe 10x1 . 10 being the number of scenarios 0 to 9, and the only column would be for the number of occurence over 25 years – Jng Jun 27 '19 at 14:40
@Jng wouldn't it be just adding `rowSums` to above solution? `rowSums(sapply(simulation2, function(x) table(factor(x, levels = 0:9))))` then? It will add up occurrence of 0, 1, 2... for all the 25 years. – Ronak Shah Jun 27 '19 at 14:54
No because the sum of each row woud be more than 10 000 – Jng Jun 27 '19 at 15:49
@Jng It would be based on the `prob` you define. If you don't define anything then ideally the sum for each number should be 10000 but as this is simulation it could be more or less than that and never exact. – Ronak Shah Jun 28 '19 at 03:18

Overlytic · Answer 2 · 2019-06-27T14:00:39.367

The base R answer from Ronak works well, but I think he meant to use simulation instead of simulation2.

sapply(simulation, function(x) table(factor(x, levels = 0:9)))

I tried to do the same thing using dplyr, since I find the tidyverse code more readable.


simulation %>% 
  rownames_to_column("i") %>% 
  gather(year, scenario, -i) %>% 
  count(year, scenario) %>% 
  spread(year, n, fill = 0)

However do note that this last option is a bit slower than the base-R code (about twice slower on my machine using your 10 000 row example)

How to count number of occurence in a large dataset

2 Answers2