0

I would like to count the number of successive occurrences by two groups while ignoring missing values. Suppose I have a dataset with 3 identifiers, one identifies the country, another variable that identifies the person, another refers to time, and a dummy variable that identifies the occurences. Something like this:

df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
                 time = c(1,2,3,1,2,1,2,1,2,3,4),
                 ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))

Which will look like:

   country person time ocurrences
1        A      a    1          1
2        A      a    2          1
3        A      a    3         NA
4        A      b    1          1
5        A      b    2          0
6        B      a    1          0
7        B      a    2         NA
8        B      b    1          1
9        B      b    2          1
10       B      b    3          0
11       B      b    4          1

I would like to generate another column with the number of consecutive occurrences by these two groups (country, person) across time. At this point, I am not sure what is the best way to deal with missing values, whether to account them as 0 or leave it as missings. For now, let's just say I will leave them as missing values. So the column should begin counting the number of successive ocurrences from the first 0 by person in a country. If the person has a 0 in between, it should stop computing and begin again on the next occurrence. So the output should look like this:

   country person time ocurrences count
1        A      a    1          1     1
2        A      a    2          1     2
3        A      a    3         NA    NA
4        A      b    1          1     1
5        A      b    2          0     0
6        B      a    1          0     0
7        B      a    2         NA    NA
8        B      b    1          1     1
9        B      b    2          1     2
10       B      b    3          0     0
11       B      b    4          1     1

I edited and added another row as an example. Notice that rows 8 and 9 the column counts the number of successive occurrences, but it stops in row 10 because there was no occurrence. Then it starts computing again in row 11. What is the best way to accomplish it?

Victor Shin
  • 183
  • 9
  • Please clarify: When you have zero occurrences count is zero, but when you have 1, count accumulates. Is this okay?. Can you have more than 1 occurence? – Ric Oct 17 '22 at 21:12
  • Yes, maybe I was not clear with my question. The column I am trying to create should compute successive occurrences. I think I will add another column that corresponds to time, maybe it will be clearer. Say the occurrence happened twice in a row. In the first row it should compute 1, in the second it should compute 2 because it was the second time in a row. However, if there is a 0 in between for the same person, then I would like the column to compute again starting in the next occurrence. I will edit the question and add a time column, I hope my explanation was clear. – Victor Shin Oct 17 '22 at 21:19
  • I am still a beginner in R. Can you point me where can I find resources to write this function @stefan_aus_hannover? I wouldn't know how to do it. – Victor Shin Oct 17 '22 at 21:39

2 Answers2

1

Well, in order to group and cumsum with your conditions, i must create first an auxiliary variable rleid (note that a function with the same name that do this can be found in package data.table) Then group by country,person, relid, then cumsum ocurrences. NA's are propagated naturally.

library(dplyr)
df <- data.frame(country = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                 person = c("a", "a", "a", "b", "b", "a", "a", "b", "b", "b","b"),
                 time = c(1,2,3,1,2,1,2,1,2,3,4),
                 ocurrences = c(1,1,NA,1,0,0,NA,1,1,0,1))
df$rleid = with(rle(ifelse(df$ocurrences==0,NA,df$ocurrences)), rep(seq_along(values), lengths))
df <- mutate(group_by(df, country, person, rleid), count=if_else(ocurrences>0,cumsum(ocurrences),0))
df <- ungroup(df)
df$rleid<-NULL
df
#> # A tibble: 11 × 5
#>    country person  time ocurrences count
#>    <chr>   <chr>  <dbl>      <dbl> <dbl>
#>  1 A       a          1          1     1
#>  2 A       a          2          1     2
#>  3 A       a          3         NA    NA
#>  4 A       b          1          1     1
#>  5 A       b          2          0     0
#>  6 B       a          1          0     0
#>  7 B       a          2         NA    NA
#>  8 B       b          1          1     1
#>  9 B       b          2          1     2
#> 10 B       b          3          0     0
#> 11 B       b          4          1     1

See this for an explanation of rleid, which basically allows to create groups when data changes with respect to previous row.

Ric
  • 5,362
  • 1
  • 10
  • 23
1

This oneliner will do the job

library(data.table)

setDT(df)

df[, count := cumsum(ocurrences), .(country, person, rleid(ocurrences))]

results

df

    country person time ocurrences count
 1:       A      a    1          1     1
 2:       A      a    2          1     2
 3:       A      a    3         NA    NA
 4:       A      b    1          1     1
 5:       A      b    2          0     0
 6:       B      a    1          0     0
 7:       B      a    2         NA    NA
 8:       B      b    1          1     1
 9:       B      b    2          1     2
10:       B      b    3          0     0
11:       B      b    4          1     1
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22