0

The sample file looks like this:

 ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
  '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
  '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
  '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
  '$$$\n', '\n',
  '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
  '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
  '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
  '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
  '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
  '>B42\n', 'TT-GTGGGTATC\n']

The $$$ separates the two sets. I need to use .strip function and remove the \n and all the "headers".

I need to make a list (as below) and replace "-" with Z

  [ 'TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
    'TCCGTGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
    'ATCGGGGGTATT','TT-GTGGGAATC','TTCGTGGGAATC', 'TT-GTGGGTATC',
    'TTCGTGGGTATT','TTCGGGGGTATC','TT-GTGGGTATC', 'TTCGGGGGAATC',
    'TTCGGGGGTATC','TTCGGGGGTATC','TT-GTGGGTATC']

Here is the link to a code (https://stackoverflow.com/a/39965048/6820344), where a similar question was answered. I tried to modify the code to get the output mentioned above. However, I am unable to have the list without the "$$$". Also, I need a list, not a list of lists.

seq_list = []
for x in lst:
    if x.startswith('>'):
        seq_list.append([])
        continue
    x = x.strip()
    if x:
        seq_list[-1].append(x.replace("-", "Z"))
print(seq_list)
Community
  • 1
  • 1
Rspacer
  • 2,369
  • 1
  • 14
  • 40
  • You say you need a list of lists "(as below)" but how is the below example a list of lists? – dkasak Oct 10 '16 at 22:43
  • Your expected output is a single list,not a list of lists. Does ```$$$``` separate the lists? – wwii Oct 10 '16 at 22:43
  • Yes, $$$ separates the list. But I just want a single list with all the elements. – Rspacer Oct 10 '16 at 22:44
  • @dkasak : Thanks for pointing out the error. I have corrected the same – Rspacer Oct 10 '16 at 22:45
  • Iterate over the strings in the original list, if a string starts with ```$``` or starts with ```>``` or is empty after being stripped then [continue](https://docs.python.org/3/reference/simple_stmts.html#the-continue-statement) without doing anything, otherwise append the stripped string to your *final* list. – wwii Oct 10 '16 at 22:49
  • @Biotechgeek, this looks like a fasta file (or something similar). Is that correct? – wflynny Oct 10 '16 at 22:53
  • I would use BioPython to parse youe fasta files http://stackoverflow.com/questions/31265282/how-to-randomly-extract-fasta-sequences-using-python/31265485#31265485 – Padraic Cunningham Oct 10 '16 at 23:00

1 Answers1

1
input = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
        '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
        '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
        '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n', '\n',
        '$$$\n', '\n',
        '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
        '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
        '>B5\n', 'TTCGTGGGTATT\n', '>B6\n', 'TTCGGGGGTATC\n',
        '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
        '>B9\n', 'TTCGGGGGTATC\n', '>B10\n', 'TTCGGGGGTATC\n',
        '>B42\n', 'TT-GTGGGTATC\n']

output = []

for elem in input:
    if elem.startswith('>') or \
       elem.startswith('$') or \
       elem.isspace():
         continue

    output.append(elem.replace('-', 'Z').strip())

from pprint import pprint
pprint(output, compact=True)

When the preceding code is run, the following output is the result:

['TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC',
 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'ATCGGGGGTATT', 'TTZGTGGGAATC',
 'TTCGTGGGAATC', 'TTZGTGGGTATC', 'TTCGTGGGTATT', 'TTCGGGGGTATC', 'TTZGTGGGTATC',
 'TTCGGGGGAATC', 'TTCGGGGGTATC', 'TTCGGGGGTATC', 'TTZGTGGGTATC']
dkasak
  • 2,651
  • 17
  • 26
  • More succinctly, something like works too: `filter(None, [x.strip().replace('-', 'Z') for x in input if not x[0] in '$>'])`. – wflynny Oct 10 '16 at 23:00
  • Indeed, that's a very nice succinct version. I've thought about using something similar but decided against it in the name of legibility and less magic. – dkasak Oct 10 '16 at 23:42