0

I would like to manually edit a Fastq file using Bash to multiple similar lines.

In Fastq files a sequence read starts on line 2 and then is found every fourth line (ie lines 2,6,10,14...).

I would like to create an edited text file that is identical to a Fastq file except the first 6 characters of the sequencing reads are trimmed off.

Unedited Fastq:

@M03017:21:000000000
GAGAGATCTCTCTCTCTCTCT
+
111>>B1FDFFF

Edited Fastq:

@M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF
user207421
  • 305,947
  • 44
  • 307
  • 483
The Nightman
  • 5,609
  • 13
  • 41
  • 74

2 Answers2

1

I guess awk is perfect for this:

$ awk 'NR%4==2 {gsub(/^.{6}/,"")} 1' file
@M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF

This removes the first 6 characters in all the lines in the 4k+2 position.

Explanation

  • NR%4==2 {} do things if the number of record (number of line) is on 4k+2 form.
  • gsub(/^.{6}/,"") replace the 6 first chars with empty string.
  • 1 as evaluated to True, print the line.
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • Strange, that code makes sense but when I use the above line of code and then pipe to more or output to a file I get the exact text that I started with, no trimming or errors. Any ideas why? – The Nightman Feb 16 '15 at 16:09
  • FWIW I just piped it to a new file and it is the same as to `stdout`. – user3439894 Feb 16 '15 at 16:15
  • Strange. Make some debug like saying `awk 'NR%4==2' file` and see if it print the line you want to replace. Maybe you have some header... – fedorqui Feb 16 '15 at 16:24
  • Running that does point to the correct lines of the file it appears. – The Nightman Feb 16 '15 at 16:26
  • @TheNightman you could try replacing `^.{6}` for `^......` in case the regex is not understood by your `awk`. – fedorqui Feb 16 '15 at 16:28
  • 1
    That seems to do it. Using ^...... and then outputting to a file gives me what I'm looking for. Thanks. – The Nightman Feb 16 '15 at 16:34
  • @TheNightman nice to read that! Also, mind your finger, you are accepting and unaccepting continuously :) http://stackoverflow.com/posts/28545286/timeline – fedorqui Feb 16 '15 at 16:40
  • yeah i didn't know that two answers couldn't be selected. I guess if I vote the answer as useful the person gets the credit regardless of whether or not i mark it as 'the' answer? – The Nightman Feb 16 '15 at 17:45
  • @TheNightman Yep! You can vote, accept or both. When I ask, I tend to mark as accepted the answer that helped me the most and upvote those that I consider useful and well explained (this normally includes the accepted one). The thing about the accepted answer is that is likely to get way more attention (and hence upvotes) from future visitors, since it shows first in the list. – fedorqui Feb 17 '15 at 10:17
1

GNU sed can do that:

sed -i~ '2~4s/^.\{6\}//' file

The address 2~4 means "start on line 2, repeat each 4 lines".

s means replace, ^ matches the line beginning, . matches any character, \{6\} specifies the length (a "quantifier"). The replacement string is empty (//).

-i~ replaces the file in place, leaving a backup with the ~ appended to the filename.

choroba
  • 231,213
  • 25
  • 204
  • 289