creating a dictionary and count bases from a multifasta file in Python

Question

To solve this problem I have used the BioPython library. Nevertheless I would like to learn programming and therefore I don't want to use BioPython library.

I have one Fasta File that contains the following DNA sequences:

>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79
TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGC
AGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT
>chr12_9180206_+:chr12_118582391_+:a2;2 total_counts: 135 Seed: 4 K: 20 length: 80
CTAACCCCCTACTTCCCAGACAGCTGCTCGTACAGTTTGGGCACATAGTCATCCCACTCG
GCCTGGTAACACGTGCCAGCGACAGCTGCTCGTA
>chr1_8969882_-:chr1_568670_-:a1;113 total_counts: 7600 Seed: 225 K: 20 length: 86
CACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAAC
CAAACCACTTTCACCGCCACACGACCCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGA
>chr1_8969882_-:chr1_568670_-:a2;69 total_counts: 6987 Seed: 197 K: 20 length: 120
TGAACCTACGACTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCA
TTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCG

Then I would like to create a dictionary with the first line started with > as the key of the dictionary and the sequence as the value.

And in the same time, for each one of the 4 sequences I would like to know how can I obtain the number count for each DNA base?

Thanks

score 1 · Answer 1 · answered Feb 16 '18 at 00:34

To run this code, in the terminal: python readfasta.py fastafile.fasta

import string, sys
##########I. To Load Fasta File##############
file = open(sys.argv[1]) 
rfile = file.readline()
seqs = {} 
##########II. To Make fasta dictionary####
tnv = ""#temporal name value
while rfile != "":
    if ">" in rfile:
        tnv = string.strip(rfile)
        seqs[tnv] = ""
    else:
        seqs[tnv] += string.strip(rfile)    
    rfile = file.readline()
##############III. To Make Counts########
count_what = ["A", "T", "C", "G", "ATG"]
for s in seqs:
    name = s
    seq = seqs[s]
    print s # to print seq name if you have a multifasta file
    for cw in count_what:
        print cw, seq.count(cw)# to print counts by seq

score 0 · Answer 2 · answered Dec 24 '17 at 18:00

Not a hard task with python, but what have you tried so far?

import pprint

with open('/path/to/subject.fasta') as f:
    ret = {}

    all_bases = ''
    bases = ''
    description_line = ''
    for l in f:
        l = l.strip()
        if l.startswith('>'):
            if bases:
                ret[description_line] = bases
                bases = ''
            description_line = l
        else:
            bases += l
            all_bases += l
    if bases:
        ret[description_line] = bases

pprint.pprint(ret)

you got:

{'>chr12_9180206_+:chr12_118582391_+:a1;2 total_counts: 115 Seed: 4 K: 20 length: 79':
 'TTGGTTTCGTGGTTTTGCAAAGTATTGGCCTCCACCGCTATGTCTGGCTGGTTTACGAGCAGGACAGGCCGCTAAAGTGTGGTTTCGTGGTT',
 ...}

to get count as all bases:

from collections import Counter
print(Counter(all_bases))

yields:

Counter({'C': 150, 'T': 114, 'A': 110, 'G': 91})

Thank you for your help georgexsh. Yes I've tried. Nevertheless, I would like to count the number of each base for each sequence of the FASTA file. Therefore, 4 counts because there are 4 different sequences in the FASTA file. — David, Dec 24 '17 at 22:26
@David move counter into the file reading loop, can you do it? (I feel that I shouldn't answer this at first) — georgexsh, Dec 25 '17 at 06:17

creating a dictionary and count bases from a multifasta file in Python

2 Answers2