R - Filtering rows - Keep data between two repeated values

Question

I would like to keep only rows before and after a specific values of a column in my data frame. My data frame below has a structure, where you can see that I have some sort of blocks if I can say so. For example, the data I am interested with always (or most of the time to be specific) starts at group and ends at section, anything else outside I would like to remove.

# Sample Data
df <- data.frame(
  Name = c ("x1","NA","group","Jean","Philippe","Celine","Dion","section","NA",
            "y2","z1","NA","group","Rosa","Albert","Stromae","section","NA","abc","something",
            "group","Sam","Liz"), 
  value = as.character(seq(1:23))
)

df
        Name value
1         x1     1
2         NA     2
3      group     3
4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
8    section     8
9         NA     9
10        y2    10
11        z1    11
12        NA    12
13     group    13
14      Rosa    14
15    Albert    15
16   Stromae    16
17   section    17
18        NA    18
19       abc    19
20 something    20
21     group    21
22       Sam    22
23       Liz    23

Since the block group:section does not always have the same information, I don't know how can I tell R to keep rows between group andsection even if they are repeated. I only came up with this, which just keeps the rows the first time R sees group andsection.

df[which(df$Name=="group")[1]:which(df$Name=="section")[1],]
      Name value
3    group     3
4     Jean     4
5 Philippe     5
6   Celine     6
7     Dion     7
8  section     8

Update: Also, sometimes in my data I will have a block that starts with group but does not have an ending section. I would like to keep this information too. Based on your solutions, I added a row with section everytime I don't have it, then apply what you proposed. I don't know if there is another way to take into account this case without adding a new row to the data.

The desired output would be


4       Jean     4
5   Philippe     5
6     Celine     6
7       Dion     7
14      Rosa    14
15    Albert    15
16   Stromae    16
22       Sam    22
23       Liz    23

Thank you guys in advance for your help.

Maël · Answer 1 · 2022-03-03T16:44:13.317

3

You can use cumsum like so:

df %>% 
  mutate(cum = lag(cumsum(Name == "group")) + cumsum(Name == "section")) %>% 
  filter(cum %% 2 == 1)

Or, using sequence in base R:

start <- which(df$Name == "group")
end <- which(df$Name == "section")

df[sequence(end-start-1, start+1),]

output

# A tibble: 7 x 3
  Name     value   cum
  <chr>    <chr> <int>
1 Jean     4         1
2 Philippe 5         1
3 Celine   6         1
4 Dion     7         1
5 Rosa     14        3
6 Albert   15        3
7 Stromae  16        3

edited Mar 03 '22 at 16:44

answered Mar 03 '22 at 16:38

Maël

45,206
3
29
67

1

Thank you @Maël !! both alternatives worked on my actual data set. I have chosen `sequence` because I understood it better. – Janet Mar 03 '22 at 17:04

score 3 · Answer 2 · answered Mar 03 '22 at 16:40

3

There are multiple instances of those cases. So, we may need to loop to get the : as : is not vectorized

i1 <- which(df$Name %in% 'group')
i2 <- which(df$Name %in% 'section')
df[unlist(Map(`:`, i1+1, i2-1)),]

-output

       Name value
4      Jean     4
5  Philippe     5
6    Celine     6
7      Dion     7
14     Rosa    14
15   Albert    15
16  Stromae    16

answered Mar 03 '22 at 16:40

akrun

874,273
37
540
662

Thank you @akrun! I do have a question though (please feel free to tell me If I have to post it): If for some reason in my data set i do have a block that strats with `group` but does not end with `section`, in this case `i2` would not work, what would be the alternative. Thanks in advance – Janet Mar 03 '22 at 17:13
@Janet I should have mentioned that this assumes equal number of group/section. If you don't have equal number of group, section, are you ignoring those cases? – akrun Mar 03 '22 at 18:27
Yes @akrun I would still want to keep theses rows. Sometimes I would have at the bottom of my data in column `Name` , `Section` followed by few rows. At the moment I just tell `R` to check , with what you suggested, if we dont have equal number of group/section, then add a row that contains `section`, and then run your code. I know it is not clever and it is lazy but I didn't know how I could handle this case. – Janet Mar 04 '22 at 02:33
@Janet Can you please update your post with new example or post as a new question because in the current question, most of the answers didn't consider those cases – akrun Mar 04 '22 at 15:27
1

Yes @akrun I will do it – Janet Mar 04 '22 at 16:16

score 2 · Answer 3 · answered Mar 03 '22 at 16:54

Another possible solution:

library(tidyverse)

df %>% 
  mutate(aux1 = if_else(Name == "group", 1, NA_real_),
         aux2 = if_else(Name == "section", 2, NA_real_),
         aux = coalesce(aux1, aux2)) %>% 
  fill(aux) %>% 
  filter(aux == 1 & Name != "group") %>% 
  select(Name, value)

#>       Name value
#> 1     Jean     4
#> 2 Philippe     5
#> 3   Celine     6
#> 4     Dion     7
#> 5     Rosa    14
#> 6   Albert    15
#> 7  Stromae    16

Stefano Barbi · Answer 4 · 2022-03-03T18:04:05.177

The following use a sort of state machine where the last state is found in the .x term within the accumulate cycle and the input is in .y:

library(dplyr)
library(purrr)

df |>
  mutate(state = accumulate(tail(Name, -1), ~{
    if (.y == "section")
      "end"
    else
      if (.y == "group")
        "start"
    else
      if (.x == "start")
        "within"
    else
      if (.x == "end")
        "outside"
    else
      .x
  }, .init = "outside"))

##>        Name value   state
##>1         x1     1 outside
##>2         NA     2 outside
##>3      group     3   start
##>4       Jean     4  within
##>5   Philippe     5  within
##>6     Celine     6  within
##>7       Dion     7  within
##>8    section     8     end
##>9         NA     9 outside
##>10        y2    10 outside
##>11        z1    11 outside
##>12        NA    12 outside
##>13     group    13   start
##>14      Rosa    14  within
##>15    Albert    15  within
##>16   Stromae    16  within
##>17   section    17     end
##>18        NA    18 outside
##>19       abc    19 outside
##>20 something    20 outside

Then you can filter by state == "within" to obtain:

#>       Name value 
#> 1     Jean     4 
#> 2 Philippe     5 
#> 3   Celine     6 
#> 4     Dion     7 
#> 5     Rosa    14 
#> 6   Albert    15 
#> 7  Stromae    16

R - Filtering rows - Keep data between two repeated values

4 Answers4