0

I am trying to convert a 'fastq' file in to a tab-delimited file using python3. Here is the input: (line 1-4 is one record that i require to print as tab separated format). Here, I am trying to read in each record in to a list object:

@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***

using this:

data = open('sample3.fq')
fq_record = data.read().replace('@', ',@').split(',')
for item in fq_record:
        print(item.replace('\n', '\t').split('\t'))

Output is:

['']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '']
['@SEQ_ID', 'GATTTGGGGTT', '+', "!''*((((***", '', '']

I am geting a blank line at the begining of the output, which I do not understand why ?? I am aware that this can be done in so many other ways but I need to figure out the reason as I am learning python. Thanks

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33

3 Answers3

1

When you replace @ with ,@, you put a comma at the beginning of the string (since it starts with @). Then when you split on commas, there is nothing before the first comma, so this gives you an empty string in the split. What happens is basically like this:

>>> print ',x'.split(',')
['', 'x']

If you know your data always begins with @, you can just skip the empty record in your loop. Just do for item in fq_record[1:].

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • A simpler way is to just leave the empty string there and skip it in your loop. See my edited answer. – BrenBarn Jun 08 '13 at 19:26
  • Even better is to skip empty strings, wherever they are, by filtering your results: `fq_record = [ x for x in fq_record if x ]`. That way you don't risk throwing away a non-empty string by mistake. – alexis Jun 08 '13 at 22:49
  • 1
    @alexis: That particular approach won't work here, as the record isn't empty: it's a list containing an empty string. You could check for that, but if the data can be large it'd probably be better to check inside the loop, to avoid looping over the data repeatedly. – BrenBarn Jun 08 '13 at 23:06
  • Oops, I misunderstood the format of `fq_record`. Yes, the comprehension I provided must be applied to the list generated by `split()`, which will contain some empty elements. – alexis Jun 08 '13 at 23:49
0

You can also go line-by-line without all the replacing:

fobj = io.StringIO("""@SEQ_ID
GATTTGGGGTT
+
!''*((((***
@SEQ_ID
GATTTGGGGTT
+
!''*((((***""")

data = []
entry = []
for raw_line in fobj:
    line = raw_line.strip()
    if line.startswith('@'):
        if entry:
            data.append(entry)
        entry = []
    entry.append(line)
data.append(entry)

data looks like this:

[['@SEQ_ID', 'GATTTGGGGTTy', '+', "!''*((((***"],
 ['@SEQ_ID', 'GATTTGGGGTTx', '+', "!''*((((***"]]
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
0

Thank you all for your answers. As a beginner, my main problem was the occurrence of a blank line upon .split(',') which I have now understood conceptually. So my first useful program in python is here:

# this script converts a .fastq file in to .fasta format

import sys 
# Usage statement:
print('\nUsage: fq2fasta.py input-file output-file\n=========================================\n\n')

# define a function for fasta formating
def format_fasta(name, sequence):
fasta_string = '>' + name + "\n" + sequence + '\n'
return fasta_string

# open the file for reading
data = open(sys.argv[1])
# open the file for writing
fasta = open(sys.argv[2], 'wt')
# feed all fastq records in to a list 
fq_records = data.read().replace('@', ',@').split(',')

# iterate through list objects
for item in fq_records[1:]: # this is to avoid the first line which is created as blank by .split() function
    line = item.replace('\n', '\t').split('\t')
    name = line[0]
    sequence = line[1]      
    fasta.write(format_fasta(name, sequence))
fasta.close()

Other things suggested in the answers would be more clear to me as I learn more. Thanks again.