I still have a hurdle with my data. Here is reproducible df:
signal1 <- c(rep(1:6))
signal2 <- c(rep(7:12))
signal3 <- c(rep(13:18))
signal4 <- c(rep(19:24))
val <- c(2.5,3.2,2.9,0.1,0.4,4.1)
tag <- c('str1','str2','str3','str4','str5','str6')
gene <- c('ABC','ABC','ABC','DEF','DEF','DEF')
df <- data.frame(signal1,signal2,signal3,signal4,gene,FC)
signal1 signal2 signal3 signal4 gene val
1 1 7 13 19 ABC 2.5
2 2 8 14 20 ABC 3.2
3 3 9 15 21 ABC 2.9
4 4 10 16 22 DEF 0.1
5 5 11 17 23 DEF 0.4
6 6 12 18 24 DEF 4.1
Example I
I'd like to select rows which make a streak, series (2 or more) based on a value val
bigger let's say than 2.5
in groups gene
. The problem is that the rows should be one by one, so desired output should be like:
signal1 signal2 signal3 signal4 gene val
1 1 7 13 19 ABC 2.5
2 2 8 14 20 ABC 3.2
3 3 9 15 21 ABC 2.9
Three rows from group ABC
fulfilled conditions - series length - 3, one by one, all of them have val >= 2.5
Example II
For dataset:
signal1 signal2 signal3 signal4 gene val
1 1 7 13 19 ABC 2.5
2 2 8 14 20 ABC 0.2
3 3 9 15 21 ABC 2.9
4 4 10 16 22 DEF 0.1
5 5 11 17 23 DEF 0.4
6 6 12 18 24 DEF 4.1
The result, empty df, because none of the rows in groups make a streak.
Example III
signal1 signal2 signal3 signal4 gene val
1 1 7 13 19 ABC 0.5
2 2 8 14 20 ABC 3.2
3 3 9 15 21 ABC 2.9
4 4 10 16 22 DEF 7.1
5 5 11 17 23 DEF 4.4
6 6 12 18 24 DEF 2.1
Output:
signal1 signal2 signal3 signal4 gene val
2 2 8 14 20 ABC 3.2
3 3 9 15 21 ABC 2.9
4 4 10 16 22 DEF 7.1
5 5 11 17 23 DEF 4.4
Two sets/streaks/series of rows one by one with val >= 2.5
Example IV
Let's take a bigger dataset:
signal1 signal2 signal3 signal4 gene val
1 1 11 21 31 ABC 0.5
2 2 12 22 32 ABC 3.2
3 3 13 23 33 ABC 2.9
4 4 14 24 34 ABC 7.1
5 5 15 25 35 ABC 0.4
6 6 16 26 36 DEF 4.1
7 7 17 27 37 DEF 6.2
8 8 18 28 38 DEF 0.2
9 9 19 29 39 DEF 3.2
10 10 20 30 40 DEF 12.1
An the output:
signal1 signal2 signal3 signal4 gene val
2 2 12 22 32 ABC 3.2
3 3 13 23 33 ABC 2.9
4 4 14 24 34 ABC 7.1
6 6 16 26 36 DEF 4.1
7 7 17 27 37 DEF 6.2
9 9 19 29 39 DEF 3.2
10 10 20 30 40 DEF 12.1
I hope that you see what do I looking for.
I tried to do something with dplyr
:
df %>%
group_by(gene) %>%
group_by(val >= 2.5)
And result for data from Example II:
# A tibble: 6 x 7
# Groups: FC >= 2.5 [2]
signal1 signal2 signal3 signal4 gene FC `FC >= 2.5`
<int> <int> <int> <int> <fct> <dbl> <lgl>
1 1 7 13 19 ABC 2.50 T
2 2 8 14 20 ABC 2.40 F
3 3 9 15 21 ABC 2.90 T
4 4 10 16 22 DEF 0.100 F
5 5 11 17 23 DEF 0.400 F
6 6 12 18 24 DEF 4.10 T
And now select rows where we have T
one by one in at least in two occurrences. In this case, we have not such situation...
I'll be very grateful for the help.
EDIT:
Answer proposed by akrun does the trick: For dataset:
signal1 signal2 signal3 signal4 gene val
1 1 11 21 31 ABC 0.5
2 2 12 22 32 ABC 3.2
3 3 13 23 33 ABC 0.9
4 4 14 24 34 ABC 7.1
5 5 15 25 35 ABC 0.4
6 6 16 26 36 DEF 4.1
7 7 17 27 37 DEF 6.2
8 8 18 28 38 DEF 0.2
9 9 19 29 39 DEF 0.2
10 10 20 30 40 DEF 12.1
I'd like to have only two rows with DEF
number 6 and 7.
And we have:
# A tibble: 2 x 6
signal1 signal2 signal3 signal4 gene val
<int> <int> <int> <int> <fct> <dbl>
1 6 16 26 36 DEF 4.10
2 7 17 27 37 DEF 6.20
Works great!
EDIT #2:
Unfortunately I found small bug:
For data:
signal1 signal2 signal3 signal4 gene val
1 1 11 21 31 ABC 0.5
2 2 12 22 32 ABC 3.2
3 3 13 23 33 ABC 7.9
4 4 14 24 34 DEF 8.1
5 5 15 25 35 DEF 0.4
6 6 16 26 36 DEF 4.1
7 7 17 27 37 GHI 6.0
8 8 18 28 38 GHI 0.2
9 9 19 29 39 GHI 8.2
10 10 20 30 40 JKL 12.1
Only rows 2 and 3 should be returned and after:
f1(df, gene, val)
We have:
# A tibble: 6 x 6
signal1 signal2 signal3 signal4 gene val
<int> <int> <int> <int> <fct> <dbl>
1 2 12 22 32 ABC 3.20
2 3 13 23 33 ABC 7.90
3 4 14 24 34 DEF 8.10
4 6 16 26 36 DEF 4.10
5 7 17 27 37 GHI 6.00
6 9 19 29 39 GHI 8.20
However your fisrt code:
df %>%
group_by(gene, grp = rleid(val >= 2.5)) %>%
filter(val >= 2.5, n() > 1) %>%
ungroup %>%
select(-grp)
Returned:
# A tibble: 2 x 6
signal1 signal2 signal3 signal4 gene val
<int> <int> <int> <int> <fct> <dbl>
1 2 12 22 32 ABC 3.20
2 3 13 23 33 ABC 7.90
I think that tidyverse
masked dplyr
functions, and after a session restart in R:
Dataset:
signal1 <- c(rep(1:10))
signal2 <- c(rep(11:20))
signal3 <- c(rep(21:30))
signal4 <- c(rep(31:40))
val <- c(0.5,3.2,7.9,8.1,4.4,0.1,6.0,0.2,8.2,12.1)
tag <- c('str1','str2','str3','str4','str5','str6','str7','str8','str9','str10')
gene <- c('ABC','ABC','ABC','DEF','DEF','DEF','GHI','GHI','GHI','JKL')
df <- data.frame(signal1,signal2,signal3,signal4,gene,val)
df
signal1 signal2 signal3 signal4 gene val
1 1 11 21 31 ABC 0.5
2 2 12 22 32 ABC 3.2
3 3 13 23 33 ABC 7.9
4 4 14 24 34 DEF 8.1
5 5 15 25 35 DEF 4.4
6 6 16 26 36 DEF 0.1
7 7 17 27 37 GHI 6.0
8 8 18 28 38 GHI 0.2
9 9 19 29 39 GHI 8.2
10 10 20 30 40 JKL 12.1
Restult obtained with:
df %>%
group_by(gene, grp = rleid(val >= 2.5)) %>%
filter(val >= 2.5, n() > 1) %>%
ungroup %>%
select(-grp
CORRECT
# A tibble: 4 x 6
signal1 signal2 signal3 signal4 gene val
<int> <int> <int> <int> <fct> <dbl>
1 2 12 22 32 ABC 3.20
2 3 13 23 33 ABC 7.90
3 4 14 24 34 DEF 8.10
4 5 15 25 35 DEF 4.40
Result obtained with function:
f1 <- function(dat, grp1, grp2) {
grp1 <- dplyr::enquo(grp1)
grp2 <- dplyr::enquo(grp2)
dat %>%
dplyr::group_by(!! grp1) %>%
dplyr::group_by(grp = data.table::rleid(!!(grp2) >= 2.5), add = TRUE) %>%
dplyr::filter(val >= 2.5, n() > 1) %>%
ungroup %>%
dplyr::select(-grp)
}
# A tibble: 6 x 6
signal1 signal2 signal3 signal4 gene val
<int> <int> <int> <int> <fct> <dbl>
1 2 12 22 32 ABC 3.20
2 3 13 23 33 ABC 7.90
3 4 14 24 34 DEF 8.10
4 5 15 25 35 DEF 4.40
5 7 17 27 37 GHI 6.00
6 9 19 29 39 GHI 8.20
Unfortunately it isn't correct, there're no streak in on by one rows in GHI
...