-3

I would like to extract only reads that have a coverage above 2 and length above 504. This is all stored in each header of FASTQ file. However I can't workout a one-liner that would filter based on these qualities. See an example of what two of the lines of the input FASTQ looks like.

Thank you for your help.

>NODE_303303_length_504_cov_30.000000
CAGGATGTTGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
>NODE_303603_length_56_cov_1.000000
CAGGATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  • 1
    Look for lines that start with ">" and, using underscore as a field separator, the 4th and 6th field have the values you want to examine. When a line matches, print it then get and print the next line. – glenn jackman Oct 25 '19 at 14:06
  • Is the header on a line by itself, followed by one or more lines of payload? – gboffi Oct 25 '19 at 15:08

1 Answers1

2

It is recommended that you provide an input file and an output file to express more clearly what you are trying to accomplish. Also, include any code that you have attempted.

Let me give it a go:

Let's assume that each input line looks like this:

>NODE_<node>_length_<length>_cov_<cov> <data>
<data1>
<data2>...
>NODE_<node>_length_<length>_cov_<cov> <data>

We can then parse the input, using the underscores and spaces as field separators. Here's an awk program that may work for you:

awk -F'[_ ]' '
  $1 == ">NODE" { p = 0 } 
  $1 == ">NODE" && $4 > 504 && $6 > 2 { p=1 } 
  p == 1 { print } 
' FASTQ_file

Using your example as input, there is no output. But here's another sample input file:

>NODE_303603_length_560_cov_2.000000 CAGGATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  more data - don't expect to see this output
>NODE_303603_length_505_cov_2.000000 CAGGATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  more data - don't expect to see this output
>NODE_303603_length_505_cov_2.000001 CAGGATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  more data
  this is the data we expect to see
>NODE_303303_length_504_cov_30.000000 CAGGATGTTGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  more data - don't expect to see this output

And here's the output when we put it all together:

 awk -F'[_ ]' '
  $1 == ">NODE" { p = 0 } 
  $1 == ">NODE" && $4 > 504 && $6 > 2 { p=1 } 
  p == 1 { print } 
' FASTQ_file

>NODE_303603_length_505_cov_2.000001 CAGGATGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGT
  more data
  this is the data we expect to see
Mark
  • 4,249
  • 1
  • 18
  • 27
  • Thank you, sorry I wasn't clear with the input file but, if its format is. >NODE__length__cov_ . Or the data can be the following lines before the next header. – bio_bob_blob Oct 25 '19 at 14:28
  • Ok @bio_bob_blob, I updated the code to deal with data after the header line. – Mark Oct 25 '19 at 16:15