Select rows from df that make subgroups (one by one) based on their value

Question

I still have a hurdle with my data. Here is reproducible df:

signal1 <- c(rep(1:6))
signal2 <- c(rep(7:12))
signal3 <- c(rep(13:18))
signal4 <- c(rep(19:24))
val <- c(2.5,3.2,2.9,0.1,0.4,4.1)
tag <- c('str1','str2','str3','str4','str5','str6')
gene <- c('ABC','ABC','ABC','DEF','DEF','DEF')
df <- data.frame(signal1,signal2,signal3,signal4,gene,FC)

  signal1 signal2 signal3 signal4 gene val
1       1       7      13      19  ABC 2.5
2       2       8      14      20  ABC 3.2
3       3       9      15      21  ABC 2.9
4       4      10      16      22  DEF 0.1
5       5      11      17      23  DEF 0.4
6       6      12      18      24  DEF 4.1

Example I

I'd like to select rows which make a streak, series (2 or more) based on a value val bigger let's say than 2.5 in groups gene. The problem is that the rows should be one by one, so desired output should be like:

  signal1 signal2 signal3 signal4 gene val
1       1       7      13      19  ABC 2.5
2       2       8      14      20  ABC 3.2
3       3       9      15      21  ABC 2.9

Three rows from group ABC fulfilled conditions - series length - 3, one by one, all of them have val >= 2.5

Example II

For dataset:

  signal1 signal2 signal3 signal4 gene val
1       1       7      13      19  ABC 2.5
2       2       8      14      20  ABC 0.2
3       3       9      15      21  ABC 2.9
4       4      10      16      22  DEF 0.1
5       5      11      17      23  DEF 0.4
6       6      12      18      24  DEF 4.1

The result, empty df, because none of the rows in groups make a streak.

Example III

  signal1 signal2 signal3 signal4 gene val
1       1       7      13      19  ABC 0.5
2       2       8      14      20  ABC 3.2
3       3       9      15      21  ABC 2.9
4       4      10      16      22  DEF 7.1
5       5      11      17      23  DEF 4.4
6       6      12      18      24  DEF 2.1

Output:

  signal1 signal2 signal3 signal4 gene val
2       2       8      14      20  ABC 3.2
3       3       9      15      21  ABC 2.9
4       4      10      16      22  DEF 7.1
5       5      11      17      23  DEF 4.4

Two sets/streaks/series of rows one by one with val >= 2.5

Example IV

Let's take a bigger dataset:

   signal1 signal2 signal3 signal4 gene  val
1        1      11      21      31  ABC  0.5
2        2      12      22      32  ABC  3.2
3        3      13      23      33  ABC  2.9
4        4      14      24      34  ABC  7.1
5        5      15      25      35  ABC  0.4
6        6      16      26      36  DEF  4.1
7        7      17      27      37  DEF  6.2
8        8      18      28      38  DEF  0.2
9        9      19      29      39  DEF  3.2
10      10      20      30      40  DEF 12.1

An the output:

   signal1 signal2 signal3 signal4 gene  val
2        2      12      22      32  ABC  3.2
3        3      13      23      33  ABC  2.9
4        4      14      24      34  ABC  7.1
6        6      16      26      36  DEF  4.1
7        7      17      27      37  DEF  6.2
9        9      19      29      39  DEF  3.2
10      10      20      30      40  DEF 12.1

I hope that you see what do I looking for.

I tried to do something with dplyr:

df %>%
  group_by(gene) %>%
  group_by(val >= 2.5)

And result for data from Example II:

# A tibble: 6 x 7
# Groups:   FC >= 2.5 [2]
  signal1 signal2 signal3 signal4 gene     FC `FC >= 2.5`
    <int>   <int>   <int>   <int> <fct> <dbl> <lgl>      
1       1       7      13      19 ABC   2.50  T          
2       2       8      14      20 ABC   2.40  F          
3       3       9      15      21 ABC   2.90  T          
4       4      10      16      22 DEF   0.100 F          
5       5      11      17      23 DEF   0.400 F          
6       6      12      18      24 DEF   4.10  T

And now select rows where we have T one by one in at least in two occurrences. In this case, we have not such situation...

I'll be very grateful for the help.

EDIT:

Answer proposed by akrun does the trick: For dataset:

   signal1 signal2 signal3 signal4 gene  val
1        1      11      21      31  ABC  0.5
2        2      12      22      32  ABC  3.2
3        3      13      23      33  ABC  0.9
4        4      14      24      34  ABC  7.1
5        5      15      25      35  ABC  0.4
6        6      16      26      36  DEF  4.1
7        7      17      27      37  DEF  6.2
8        8      18      28      38  DEF  0.2
9        9      19      29      39  DEF  0.2
10      10      20      30      40  DEF 12.1

I'd like to have only two rows with DEF number 6 and 7.

And we have:

# A tibble: 2 x 6
  signal1 signal2 signal3 signal4 gene    val
    <int>   <int>   <int>   <int> <fct> <dbl>
1       6      16      26      36 DEF    4.10
2       7      17      27      37 DEF    6.20

Works great!

EDIT #2:

Unfortunately I found small bug:

For data:

   signal1 signal2 signal3 signal4 gene  val
1        1      11      21      31  ABC  0.5
2        2      12      22      32  ABC  3.2
3        3      13      23      33  ABC  7.9
4        4      14      24      34  DEF  8.1
5        5      15      25      35  DEF  0.4
6        6      16      26      36  DEF  4.1
7        7      17      27      37  GHI  6.0
8        8      18      28      38  GHI  0.2
9        9      19      29      39  GHI  8.2
10      10      20      30      40  JKL 12.1

Only rows 2 and 3 should be returned and after:

f1(df, gene, val)

We have:

# A tibble: 6 x 6
  signal1 signal2 signal3 signal4 gene    val
    <int>   <int>   <int>   <int> <fct> <dbl>
1       2      12      22      32 ABC    3.20
2       3      13      23      33 ABC    7.90
3       4      14      24      34 DEF    8.10
4       6      16      26      36 DEF    4.10
5       7      17      27      37 GHI    6.00
6       9      19      29      39 GHI    8.20

However your fisrt code:

df %>% 
  group_by(gene, grp = rleid(val >= 2.5)) %>%
  filter(val >= 2.5, n() > 1) %>%
  ungroup %>%
  select(-grp)

Returned:

# A tibble: 2 x 6
  signal1 signal2 signal3 signal4 gene    val
    <int>   <int>   <int>   <int> <fct> <dbl>
1       2      12      22      32 ABC    3.20
2       3      13      23      33 ABC    7.90

I think that tidyverse masked dplyr functions, and after a session restart in R:

Dataset:

signal1 <- c(rep(1:10))
signal2 <- c(rep(11:20))
signal3 <- c(rep(21:30))
signal4 <- c(rep(31:40))
val <- c(0.5,3.2,7.9,8.1,4.4,0.1,6.0,0.2,8.2,12.1)
tag <- c('str1','str2','str3','str4','str5','str6','str7','str8','str9','str10')
gene <- c('ABC','ABC','ABC','DEF','DEF','DEF','GHI','GHI','GHI','JKL')
df <- data.frame(signal1,signal2,signal3,signal4,gene,val)
df
   signal1 signal2 signal3 signal4 gene  val
1        1      11      21      31  ABC  0.5
2        2      12      22      32  ABC  3.2
3        3      13      23      33  ABC  7.9
4        4      14      24      34  DEF  8.1
5        5      15      25      35  DEF  4.4
6        6      16      26      36  DEF  0.1
7        7      17      27      37  GHI  6.0
8        8      18      28      38  GHI  0.2
9        9      19      29      39  GHI  8.2
10      10      20      30      40  JKL 12.1

Restult obtained with:

df %>% 
  group_by(gene, grp = rleid(val >= 2.5)) %>%
  filter(val >= 2.5, n() > 1) %>%
  ungroup %>%
  select(-grp

CORRECT

# A tibble: 4 x 6
  signal1 signal2 signal3 signal4 gene    val
    <int>   <int>   <int>   <int> <fct> <dbl>
1       2      12      22      32 ABC    3.20
2       3      13      23      33 ABC    7.90
3       4      14      24      34 DEF    8.10
4       5      15      25      35 DEF    4.40

Result obtained with function:

f1 <- function(dat, grp1, grp2) {
  grp1 <- dplyr::enquo(grp1)
  grp2 <- dplyr::enquo(grp2)
  dat %>%
    dplyr::group_by(!! grp1) %>%
    dplyr::group_by(grp = data.table::rleid(!!(grp2) >= 2.5), add = TRUE) %>%
    dplyr::filter(val >= 2.5, n() > 1) %>%
    ungroup %>%
    dplyr::select(-grp)
}

# A tibble: 6 x 6
  signal1 signal2 signal3 signal4 gene    val
    <int>   <int>   <int>   <int> <fct> <dbl>
1       2      12      22      32 ABC    3.20
2       3      13      23      33 ABC    7.90
3       4      14      24      34 DEF    8.10
4       5      15      25      35 DEF    4.40
5       7      17      27      37 GHI    6.00
6       9      19      29      39 GHI    8.20

Unfortunately it isn't correct, there're no streak in on by one rows in GHI...

Could you check the packages loaded on your environment? I think there might be other package that have `filter`. One option is specify `dplyr::filter` explicitly — akrun, Mar 26 '18 at 07:31
I have 0.1.6.9003. So yours should be fine. However, one thing I would try is on a fresh session load only `library(dplyr);library(data.table)` and then run it. The reason is that I haven't loaded `library(tidyverse)` — akrun, Mar 26 '18 at 08:52
Yes, I even removed whole `tidyverse` R has been restarted and `dplyr` and `data.table` libraries have been loaded. Unfortunately no change. Nevertheless your first code this one without function did the trick I think. — Adamm, Mar 26 '18 at 09:10
If I load `tidyverse` fresh on a system `library(tidyverse)# Loading tidyverse: ggplot2 Loading tidyverse: tibble Loading tidyverse: tidyr Loading tidyverse: readr Loading tidyverse: purrr Loading tidyverse: dplyr Conflicts with tidy packages --------------------------------------------------- filter(): dplyr, stats### lag(): dplyr, stats` It is having some masking on `filter` — akrun, Mar 26 '18 at 09:11
In the `filter` step, I changed it to `dplyr::filter(val >= 2.5, n() >1)` and now the `df` output is `f1(df, gene, val)# # A tibble: 4 x 6 signal1 signal2 signal3 signal4 gene val 1 2 12 22 32 ABC 3.20 2 3 13 23 33 ABC 7.90 3 4 14 24 34 DEF 8.10 4 5 15 25 35 DEF 4.40` — akrun, Mar 26 '18 at 09:14
Dear Akrun, I reinstalled `dplyr` and `data.table` and I lso removed `tidyverse`. I restarted R and finally finction worked! You are the god of R! I'm more than impressed — Adamm, Mar 26 '18 at 09:16
I don't usually load `tidyverse` bcz of this problem. Glad to know that the function worked well for you. I learn things based on questions from you guys. Nothing special about me. — akrun, Mar 26 '18 at 09:17

akrun · Accepted Answer · 2018-03-26T07:09:32.280

Based on the examples, we create a function to do the filtering

library(data.table)
library(dplyr)

f1 <- function(dat, grp1, grp2) {
     grp1 <- enquo(grp1)
     grp2 <- enquo(grp2)
     dat %>%
        group_by(!! grp1) %>%
        group_by(grp = rleid(!!(grp2) >= 2.5), add = TRUE) %>%
        filter(val >= 2.5, n() > 1) %>%
        ungroup %>%
        select(-grp)
   }

-example I

f1(df1, gene, val)
# A tibble: 3 x 6
#  signal1 signal2 signal3 signal4 gene    val
#    <int>   <int>   <int>   <int> <chr> <dbl>
#1       1       7      13      19 ABC    2.50
#2       2       8      14      20 ABC    3.20
#3       3       9      15      21 ABC    2.90

-example II

f1(df2, gene, val)
# A tibble: 0 x 6
# ... with 6 variables: signal1 <int>, signal2 <int>, signal3 <int>, signal4 <int>, gene <chr>, val <dbl>

-example III

f1(df3, gene, val)
# A tibble: 4 x 6
#  signal1 signal2 signal3 signal4 gene    val
#    <int>   <int>   <int>   <int> <chr> <dbl>
#1       2       8      14      20 ABC    3.20
#2       3       9      15      21 ABC    2.90
#3       4      10      16      22 DEF    7.10
#4       5      11      17      23 DEF    4.40

-example IV

f1(df4, gene, val)
# A tibble: 7 x 6
# Groups: gene [2]
#  signal1 signal2 signal3 signal4 gene    val
#    <int>   <int>   <int>   <int> <chr> <dbl>
#1       2      12      22      32 ABC    3.20
#2       3      13      23      33 ABC    2.90
#3       4      14      24      34 ABC    7.10
#4       6      16      26      36 DEF    4.10
#5       7      17      27      37 DEF    6.20
#6       9      19      29      39 DEF    3.20
#7      10      20      30      40 DEF   12.1

-example V

f1(df5, gene, val)
# A tibble: 2 x 6
#  signal1 signal2 signal3 signal4 gene    val
#    <int>   <int>   <int>   <int> <chr> <dbl>
#1       6      16      26      36 DEF    4.10
#2       7      17      27      37 DEF    6.20

-example VI

f1(df6, gene, val)
# A tibble: 2 x 6
#  signal1 signal2 signal3 signal4 gene    val
#    <int>   <int>   <int>   <int> <chr> <dbl>
#1       2      12      22      32 ABC    3.20
#2       3      13      23      33 ABC    7.90

However will it work when we have several groups not only `ABC`, `DEF` but more, `GHI`,`JKL` etc? — Adamm, Mar 26 '18 at 06:45
@Adamm We are grouping by 'gene', so it shouldn't matter how many levels there are in the 'gene' — akrun, Mar 26 '18 at 06:46
@Adamm Just to avoid any bugs, I created the 'grp' after grouping by 'gene'. — akrun, Mar 26 '18 at 06:53
My apologizes I wasn't clear. Second is correct. For last dataset only two rows. Because there's one streak of rows only in case of `ABC` — Adamm, Mar 26 '18 at 07:07
@Adamm If you run the updated code for 'f1', you will get only 2 rows — akrun, Mar 26 '18 at 07:07
If I run you update function I still got 6 rows instead of two. I think that some package functions masks `data.table` and `dplyr` functions. — Adamm, Mar 26 '18 at 07:26
@Adamm what is your dplyr version? I use 0.7.4. Can you run on a fresh session — akrun, Mar 26 '18 at 07:28

Select rows from df that make subgroups (one by one) based on their value

1 Answers1

Linked