1

Currently I'm working with some Fastq sequencing data. I have a dataframe with three columns and hundreds of rows. The first column contains the raw sequencing reads and the others contain information about those reads. I want to return a row with the string "FALSE" in the 3rd column, plus the row directly above this, and two rows directly below it. I think it is similar to grep -A -B in shell.

I've looked around and my question is very similar to this one:

Returning above and below rows of specific rows in r dataframe

However, the answers here are based on row-names and not strings within the rows. My row names are just numbers in numerical order.

    Fastq Output    BARCODE     Dulplicated
1   ReadName1       NA          NA
2   ReadSeq1        TGTG TTAT   FALSE
3   +               NA          NA
4   Ascii_score1    NA          NA
5   ReadName2       NA          NA
6   ReadSeq2        TGCT TTAT   FALSE
7   +               NA          NA
8   Ascii_score2    NA          NA
9   ReadName3       NA          NA
10  ReadSeq3        TGCT TTAT   TRUE
11  +               NA          NA
12  Ascii_score3    NA          NA

2 Answers2

3

If the duplicated column has character values. You can do

inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]

#   FastqOutput  BARCODE Dulplicated
#1    ReadName1     <NA>          NA
#2     ReadSeq1 TGTGTTAT       FALSE
#3            +     <NA>          NA
#4 Ascii_score1     <NA>          NA
#5    ReadName2     <NA>          NA
#6     ReadSeq2 TGCTTTAT       FALSE
#7            +     <NA>          NA
#8 Ascii_score2     <NA>          NA

Or similarly using dplyr::slice

library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))

data

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Worth noting, for generality, you could do something like `offsets = -1:2`, `sort(unique(c(outer(inds, offsets, "+"))))`, in case there are more than a few offsets and you don't want to type out `-1`, `+1`, `+2`, etc. – Gregor Thomas Jul 29 '19 at 03:08
1

We can use data.table

library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
             sort(unique(i1+ rep((-2:2), length(i1)))) }]]
#    FastqOutput  BARCODE Dulplicated
#1:    ReadName1     <NA>          NA
#2:     ReadSeq1 TGTGTTAT       FALSE
#3:            +     <NA>          NA
#4: Ascii_score1     <NA>          NA
#5:    ReadName2     <NA>          NA
#6:     ReadSeq2 TGCTTTAT       FALSE
#7:            +     <NA>          NA
#8: Ascii_score2     <NA>          NA

Or it can bee written more compactly

setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]

data

df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L, 
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1", 
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3", 
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE = 
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT", 
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA, 
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame", 
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
akrun
  • 874,273
  • 37
  • 540
  • 662