0

Thank you for your previous advices,

I have another regex problem:

now I have a list with this pattern:

*7  3   279 0
*33 2   254 0.0233918128654971
*39 2   276 0.027431421446384

and a file with DNA sequencing in Fasta format:

EDIT reformated lines

>OCTU1
GCTTGTCTCAAAGATTAAGCCATGCATGTATAAGCACAAGCCTAAAATGGTGAAGCCGCGAATAGCTCATTACAACAGTCGTAGTTTATTGGAAAGTTCACTATGGATAACTGTGGTAATTCTAGAGCTAATACATGTTCCAATCCTCGACTCACGGAGAGGTGCATTTATTAGAACAAAGCTGATCAGACTATGTCTGTCTCAGGTTGACTCTGAATAACTTTGCTAATCGCACAGTCTTTGTACTGGCGATGTATCTTTCATGCTATGTA
>OCTU2
GCTGCTTCCTTGGATGTGGTAGCCGTTTCTCAGGCTCCCTCTCCGGAATCGAACCCTATTCCCCGTTACCCGTTCAACCATGGTAGGCCCTACTACCATCAAAGTTGATAGGGCAGATATTTGAAAGACATCGCCGCACAAAGGCTATGCGATTAGCAAAGTTATTAGATCAACGACGCAGCGATCGGCTTTGACTAATAAATCACCCCTCCAGTTGGGGACTTTTACATGTATTAGCTCTAGAATTACCACAGTTATCCATTAGTGAAGTACCTTCCAATAAACTATACTGTTTAATGAGCCATTCGCGGTTTCACCGTAAAATTAGGTTGTCTTAGACATGCATGGCTTAATCTTTGTAGACAAGC

I'd need to find the numbers in the list with * (e.g., 7 or 33) in the Fasta file (e.g., >OCTU7 and >OCTU33) and copy in another file only the Fasta sequences that are present in the list, this is my script:

regex=re.compile(r'.+\d+\s+')
OCTU=b.readlines()
while OCTU:
    for line in a:
        if regex.match(OCTU)==line:
              c.write(OCTU)

The scripts seems to work but I think the pattern is not correct because the file created is empty.

Thank you in advance for your precious advices.

sotapme
  • 4,695
  • 2
  • 19
  • 20
  • Hey, here are a few tips: you mention two files... which are those in the code snippet you show? The regex you want to use is probably `r'\*(\d+)\s+'` (the more specific the better). – dsign Feb 18 '13 at 15:42
  • Is the Fasta file in order ? ``>OCTO1 ... >OCTnn``, so that one could get the numbers list and then get the ``n'th * 2`` line ?. I made an edit to your Q as it wasn't showing ``>`` as that's special in the markup. – sotapme Feb 18 '13 at 16:36
  • thank you for your advices, for dsign, the two files are "a" and "b" (b for fasta, a is the list), I know that the pattern should be as specific as possible, but in the fasta file I have not * symbol..... For sotapme yes the Fasta file is >OCTU1\nACGTTCCAT.....\n>OCTU2\nGCTACCT\n....I did not realize that in the text it was not correctly written....sorry – user2072622 Feb 19 '13 at 08:43

3 Answers3

1

You could first collect the id numbers from file a to a set for fast lookup later:

seta = set()
regexa = re.compile(r'\*(\d+)') #matches asterisk followed by digits, captures digits
for line in a:
    m = regexa.match(line)      #looks for match at start of line
    if m:
        seta.add(m.group(1))

Then loop over b. Use b.next() inside the loop to get the second line where the sequence is.

regexb = re.compile(r'>OCTU(\d+)')  #matches ">OCTU" followed by digits, captures digits
for line in b:
    m = regexb.match(line)
    if m:
        sequence = b.next() 
        if m.group(1) in seta:
            c.write(line)
            c.write(sequence)
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
  • Thank you, this is a great solution! In this way I get a .txt file with all the sequences, but without the ID that is >OCTU(\d+). – user2072622 Feb 22 '13 at 14:55
  • I have tried several solutions to get in the same file all the information, tere is a way to write both m and sequence? – user2072622 Feb 22 '13 at 14:56
  • @user2072622 The >OCTU line is in the variable `line`, simply write that out. See edit. – Janne Karila Feb 22 '13 at 17:38
  • thank you for precious suggestions, the script works, but only if I process the two 'for' iteration separately, when I try to iterate the two steps sequentially the first 'for' cycle write out just the first OTU. I cannot find out what is the problem – user2072622 Mar 04 '13 at 10:37
  • @user2072622 The idea is exactly to first loop over all of `a`, then loop over `b`. That way you only need to read each file once. – Janne Karila Mar 04 '13 at 10:52
0

You may want to use Biopython to parse the fasta file.

Then you can slice out the number and look it up in your list and access the sequence and sequence name more reliably...If a fasta file has line wrapping the above method may run into problems...

import collections
from Bio import SeqIO

infile = "yourfastafile.fasta"
outfile = "desired_outfilename.fasta"

dct = collections.OrderedDict()
for record in SeqIO.parse(open(infile), "fasta"):
    dct[record.description()] = str(record.seq).upper()

for k,v in dct.items():
    if int(k[4:]) in seta: #from answer above
        with open(outfile, "a") as handle:
            handle.write(">" + k + "\n" + str(v) + "\n")
Colin Anthony
  • 1,141
  • 12
  • 21
0

coding=utf8

the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r">.+\n[acgtnACGTN\n]+"

test_str = (">AB000263 |acc=AB000263|descr=Homo sapiens mRNA for prepro cortistatin like peptide, complete cds.|len=368\n"
    "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC\n"
    "CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC\n"
    "CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG\n"
    "AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC\n"
    "CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG\n"
    "TTTAATTACAGACCTGAA")

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.