I receive to many DNA objects in my read DNA function

Question

class DnaSeq:
    
    def __init__(self, accession, seq):
        self.accession = accession
        self.seq = seq
        
    def __len__(self):
        if self.seq == None:
            raise ValueError
        elif self.seq =='':
            raise ValueError
        else:
            return len(self.seq)

    def __str__(self):
        if self.accession =='':
            raise ValueError
        elif self.accession == None:
            raise ValueError
        else:
            return f"<DnaSeq accession='{self.accession}'>"

def read_dna(filename):
    DnaSeq_objects = []
    new_dna_seq = DnaSeq("s1", "AAA")
    with open(filename, 'r') as seq:
        for line in seq.readlines():
            if line.startswith('>'):
                new_dna_seq.accession = line      
            else:
                new_dna_seq.seq = line.strip()
            DnaSeq_objects.append(new_dna_seq)
                
    return DnaSeq_objects

this is the .fa file I tried to read


> s0
> ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGT GTTAATCTTACAACCAGAACTCAAT
> s1
> GTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTC
> s2
> ACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTAC
> s3
> TCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCCCT
> s4
> AGACCCAGTCCCTACTTATTGTTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCATTT
> s5
> TTTGTAATGATCCATTTTTGGGTGTTTATTACCACAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGA

It's supposed to return 6 DNA objects but I received too many.

read_dna('ex1.fa')
[<__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>, 
 <__main__.DnaSeq object at 0x000001C67208F820>
]

How can I fix this, so that it receives the right amount

It's reading every line beginning with `>` as a sequence. In the FASTA format, only the header/description line begins with `>`. The sequence line(s) don't have any prefix, they're just single-letter bases or amino acids. — MattDMo, Feb 18 '23 at 16:31
But in the file the OP shows, what I suppose is the sequence lines seems to start with `>`, too? — fuenfundachtzig, Feb 18 '23 at 16:32
I see. In any case, the code to read the file would also not work for a correct FASTA file. — fuenfundachtzig, Feb 18 '23 at 16:34
I guess you just want to properly indent the line `DnaSeq_objects.append(new_dna_seq)` and it should work (iff you also remove the `>` from the sequence lines). — fuenfundachtzig, Feb 18 '23 at 16:34
If it's a `ValueError` for either attribute to be `None` or an empty sequence, you should not be allowed to initialize them as such in the first place. Get the necessary data from the file first, *then* pass it to `DnaSeq`. The object itself is not the place to accumulate incomplete data. — chepner, Feb 18 '23 at 17:10
You should also be creating a *new* instance of `DnaSeq` for each sequence, not continually adding to the the existing gone. — chepner, Feb 18 '23 at 17:13

MattDMo · Answer 1 · 2023-02-18T18:36:29.743

Your code is reading every line beginning with > as an accession, but it's not populating the .seq attribute because it's not finding any sequences. In the FASTA format, only the header/description/accession ID line begins with >. The sequence line(s) don't have any prefix, they're just single-letter bases or amino acids.

There's actually a lot more you need to do. You need to have a default value for self.seq, you need to parse the sequences for spaces and other irrelevant characters, and you need to be able to concatenate multiple sequence lines. Instead of rolling your own code, I highly recommend checking out Biopython.

I decided to give you some example code that will help you on your way, using a couple of neat Python constructs to condense things down a bit and clean up your original code. Please don't use this exact code as your assignment! It may contain concepts that you haven't learned about yet, or that you don't fully understand, and your professor will quickly be able to see it's not your original work. Play around with the code, make sure you understand what it does, try to think of any edge cases where it might not work as expected (such as having an accession without a sequence, or having a sequence spread over multiple lines). Then, come up with your own algorithm and submit that.

class DnaSeq:
    def __init__(self, accession, seq):
        self.accession = accession
        self.seq = seq

    def __len__(self):
        if self.seq:
            return len(self.seq)
        else:
            raise ValueError("Sequence missing")

    def __repr__(self):
        if self.accession and self.seq:
            return f"<DnaSeq accession='{self.accession}', seq='{self.seq[:15]}...'>"
        else:
            raise ValueError("Accession ID or sequence missing")


def read_dna(filename):
    DnaSeq_objects = []

    with open(filename, 'r') as f:
        # get rid of any whitespace on either end of the line
        contents = [line.strip() for line in f.readlines()]

    while len(contents): # while there are lines left to process
        if len(contents[0]) == 0: # there was just whitespace and now it's an empty string
            contents.pop(0) # pull the first item off the list
            continue # go to the next line in the list

        # no point in creating dummy values when we can use the real thing
        new_dna_seq = DnaSeq(contents.pop(0).lstrip("> "), contents.pop(0))
        DnaSeq_objects.append(new_dna_seq)

    return DnaSeq_objects

results = [str(seq_obj) for seq_obj in read_dna("ex1.fa")]
print("\n".join(results))
# "<DnaSeq accession='s0', seq='ATGTTTGTTTTTC...'>",
# "<DnaSeq accession='s1', seq='GTTAATCTTACAA...'>",
# "<DnaSeq accession='s2', seq='ACTCAGGACTTGT...'>",
# "<DnaSeq accession='s3', seq='TCTGGGACCAATG...'>",
# "<DnaSeq accession='s4', seq='AGACCCAGTCCCT...'>",
# "<DnaSeq accession='s5', seq='TTTGTAATGATCC...'>"

I cant use any imports on this task, and Biopython seems to be using a lot — pythonhelpneeded, Feb 18 '23 at 16:54
Your input file is malformed. There shouldn't be a `>` character in front of the nucleotide sequences. If this is the file you're required to work with, at a minimum you should inform your teacher/professor that it's not [FASTA format](https://en.wikipedia.org/wiki/FASTA_format), and shouldn't be a `.fa` file. One of the main features of the FASTA format (and other sequence formats) is that you need to be able to determine where an arbitrary line belongs. In this case, you're just going to have to read the first non-blank line as the accession, and the very next line as the sequence. [...] — MattDMo, Feb 18 '23 at 17:01
[...] Make sure you strip the leading `> ` from each line, and ensure that the sequence line doesn't contain any characters other than G, C, A, and T, or at the very least it should only contain alphabetic characters. — MattDMo, Feb 18 '23 at 17:03
stack overflow changed it a little, but the > is not in front of the nucleotides, I removed > with .strip().replace('>','') . — pythonhelpneeded, Feb 18 '23 at 17:24
how do I remove a blank line, when the blank line's len() is not 0 ? — pythonhelpneeded, Feb 18 '23 at 17:25
@pythonhelpneeded so the first line of the file is indeed blank? — MattDMo, Feb 18 '23 at 17:31

Aditya Mahakali · Answer 2 · 2023-02-18T19:13:57.000

0

In Your Loop You Should change The condition:

for line in seq.readlines():
    if line.startswith('>'):
        new_dna_seq.accession = line      
    else:
        new_dna_seq.seq = line.strip()
        DnaSeq_objects.append(new_dna_seq)

To:

for line in seq.readlines():
    if line.startswith('> s'):
        new_dna_seq.accession = line.strip().replace('> ', '')      
    else:
        new_dna_seq.seq = line.strip().replace('> ', '')
        DnaSeq_objects.append(new_dna_seq)

in Your, if statement checks if the line starts with '> s' and indent the object appending within the else block. I have also removed'>', from your accession and sequence as That seems unnecessary.

edited Feb 18 '23 at 19:13

answered Feb 18 '23 at 17:11

Aditya Mahakali

1
2

What if the accession doesn't start with `s`? – MattDMo Feb 18 '23 at 17:27
Then it would not work, my solution is based on the file structure given in the question. But with the above corrections, the given problem can be solved. – Aditya Mahakali Feb 18 '23 at 19:08

score -2 · Answer 3 · answered Feb 18 '23 at 16:31

-2

since DnaSeq_objects is a list, just return DnaSeq_objects[:6]. Even if the list contains less than 6 elements, this syntax will not throw an error and will just return all elements

answered Feb 18 '23 at 16:31

lollerskates

964
1
11
28

3

That's a terrible idea, and that's not what the problem is. – MattDMo Feb 18 '23 at 16:32
So your answer is "If you're getting more than you expect, just throw the rest away". Yes, this is terrible. – John Gordon Feb 18 '23 at 16:33

I receive to many DNA objects in my read DNA function

3 Answers3