-3

I have a FASTA file, it looks like this:

click for image

I want this:

sequence1: ATGCACCGT
sequence2: GACCTAGCA

as a result.

how can I do it?

edit: I'll try to reformulate it,

so I have a (fasta) file, with multiple rows. some rows has a special character (>) as a first character. I don't need these rows, but the rows shows where the first sequence ended, and where the other begins.

I'd like to read it into two separate strings first string is the first sequence, the second is the second one.

but I don't know how could I tell pycharm that I want to read until > sign,than the rest goes to another string, until the next > sign...

  • 1
    what do you mean "it looks like this"? are you trying to turn a jpg into ASCII data? And why do you want to use pycharm? Please explain a bit more about what you're trying to do. – Shep Oct 07 '15 at 22:38
  • no, first I wrote it down simply, but stackoverflow doesn't like enters :"( i just wanted to make it visible that it's multiple lines, it's a fasta file, and the it starts with this sign:>....the format it is given is : >information line enter sequence line enter >information line enter sequence line – AmlesLausiv Oct 07 '15 at 22:43
  • click the question mark when you're editing, you just have to format your data as code. You're probably getting downvoted because including code as images is one of the pet peeves of a lot of people on Stack Overflow. Also, there's no need to include "edit:" followed by more explanation: just clarify your original post. Unfortunately, when your post gets a lot of downvotes it also gets ranked lower on the questions page, which means you're unlikely to get useful answers. – Shep Oct 08 '15 at 06:22
  • ok, thanks, unfortunately I barely have time for anything, so I just didn't want to waste more time on learning this site's text formating codes when it is understandeable as a picture too :/ – AmlesLausiv Oct 08 '15 at 13:24
  • OK, well Stack Overflow can potentially save you a lot of time, so I don't think learning a bit of markdown to get better answers is a waste! – Shep Oct 08 '15 at 16:19

2 Answers2

0
with open('data', 'r') as f:
     s = [x.strip() for x in f]

for i, el in enumerate(s):
    if i % 2 == 0:
        s[i] = 'sequence'  + str(i+1)


print(s)

['sequence1', 'ATGCACCGT', 'sequence3', 'GACCTAGCA']
LetzerWille
  • 5,355
  • 4
  • 23
  • 26
0

I looked at the FASTA spec on Wikipedia. Looks like long sequences can span multiple lines. In that case, I assume you would want the lines concatenated. It also says that the informational lines start with a ">" but could also start with a ";". Assuming that the file is small enough to be read entirely into memory, I came up with the following using regular expressions:

import re

regex = re.compile(r"[;>](?P<description>[^\n]*)\n(?P<sequence>[^;>]+)")

with open("datafile.txt","r") as f:

    sequences = regex.findall(f.read())
    for i, info in enumerate(sequences):
            description, sequence = info
            print("sequence%d: %s" % (i, sequence.replace("\n","")))
RobertB
  • 1,879
  • 10
  • 17