Find length of a contig in one fasta, using the header of another fasta as query in python

Question

I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like to save the output of this script as a variable to then use later on in the same pipeline.

Below is an updated version of the script using code provided by Lucía Balestrazzi.

Additional information: The following with-statement is nested inside a larger for-loop that cycles through subsamples of an original genome. The first subsample fasta in my directory has a single sequence ">chr1:0-40129801" with a length of 40129801. I'm trying to write out a text file "OUTPUT" that has some basic information about each subsample fasta. This text file will be used as an input for another program downstream.

Header names in the original fasta file are chr1, chr2, etc... while the header names in the subsample fastas are something along the lines of:

batch1.fa >chr1:0-40k

batch2.fa >chr1:40k-80k

...etc...

import Bio.SeqIO as IO

record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN: 
    for LINE in FIN:
        if LINE.startswith('>'):
                #Example of "LINE"... >chr1:0-40129801
                HEADER = re.sub('>','',LINE)
                #HEADER = chr1:0-40129801
                HEADER2 = re.sub('\n','',HEADER)
                #HEADER2 = chr1:0-40129801 (no return character on the end)
                CONTIG = HEADER2.split(":")[0]
                #CONTIG = chr1 
                PART2_HEADER = HEADER2.split(":")[1]
                #PART2_HEADER = 0-40129801 
                START = int(PART2_HEADER.split("-")[0])
                #START = 0
                END = int(PART2_HEADER.split("-")[1])
                #END = 40129801
                LENGTH = END-START
                #LENGTH = 40129801 minus 0 = 40129801
            
                #This is where I'm stuck...
                ORIGINAL_CONTIG_LENGTH = len(record_dict[CONTIG]) #This returns "KeyError: 1"
                #ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample). 

                OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
                #OUTPUT = 0    chr1:0-40129801    40129801    chr1    223705999
OUTPUT.close()

I'm relatively new to bioinformatics. I know I'm messing up on how I'm using the dictionary, but I'm not quite sure how to fix it.

Any advice would be greatly appreciated. Thanks!

lbal · Answer 1 · 2020-07-28T03:34:49.493

6

You can do it this way:

import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
print(len(record_dict["chr1"]))

or

import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse("genome.fa", "fasta"))
seq = record_dict["chr1"]
print(len(seq))

EDIT: Alternative code

import Bio.SeqIO as IO

record_dict = IO.to_dict(IO.parse("genome.fa", "fasta")
names = record_dict.keys()
for HEADER in names:
    #HEADER = chr1:0-40129801
    ORIGINAL_CONTIG_LENGTH = len(record_dict[HEADER])
    CONTIG = HEADER.split(":")[0]
    #CONTIG = chr1 
    PART2_HEADER = HEADER.split(":")[1]
    #PART2_HEADER = 0-40129801 
    START = int(PART2_HEADER.split("-")[0])
    END = int(PART2_HEADER.split("-")[1])
    LENGTH = END-START

The idea is that you define the dict once, get the value of its keys (all the contigs headers) and store them as a variable, and then loop through the headers extracting the info you need. No need to loop through the file.

Cheers

edited Jul 28 '20 at 03:34

answered Jul 27 '20 at 03:53

lbal

291
2
8

This is awesome! Is there anyway I can use a variable for the header name, so I don't have to use "chr1"......CONTIG = chr1...... print(len(record_dict[CONTIG])) – Gunther Jul 27 '20 at 17:05
@Gunther I'm not sure if I understood correctly what you want. If it is to print the length of all your contigs without naming them one by one, after you defined the dict you can put all the names in one variable with `names = record_dict.keys()`, and then loop though the names extracting the length: `for n in names: print(len(record_dict[n]))`. You can also add the name of the contig in the print function: `for n in names: print(n, len(record_dict[n]))`. Hope it helps! – lbal Jul 27 '20 at 20:37
Hi @Lucía! You have been a tremendous help so far in leading me towards the correct answer. I'm still having problems with my script. I have added more information to my original question. If you have the time to provide any assistance I would really appreciate it. Thanks! – Gunther Jul 28 '20 at 02:34
@Gunther I've been looking at your code. There is a problem indeed with de dict, you only have to define it once, not in every loop. Let me post an alternative code and see if it does what you need. – lbal Jul 28 '20 at 03:28
I truly appreciate your willingness to help me, Lucía. I'm sorry my original question wasn't clear. I've added a bit more context above. I'm trying to pull out the length of a sequence in one genome (the original), based on the header of a smaller subsample of that original genome. – Gunther Jul 28 '20 at 06:07

score 0 · Answer 2 · answered Jul 28 '20 at 20:51

This works, just changed the "CONTIG" variable to a string. Thanks Lucía for all your help the last couple of days!

import Bio.SeqIO as IO

record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN: 
    for LINE in FIN:
        if LINE.startswith('>'):
                #Example of "LINE"... >chr1:0-40129801
                HEADER = re.sub('>','',LINE)
                #HEADER = chr1:0-40129801
                HEADER2 = re.sub('\n','',HEADER)
                #HEADER2 = chr1:0-40129801 (no return character on the end)
                CONTIG = HEADER2.split(":")[0]
                #CONTIG = chr1 
                PART2_HEADER = HEADER2.split(":")[1]
                #PART2_HEADER = 0-40129801 
                START = int(PART2_HEADER.split("-")[0])
                #START = 0
                END = int(PART2_HEADER.split("-")[1])
                #END = 40129801
                LENGTH = END-START
                #LENGTH = 40129801 minus 0 = 40129801
            
                #This is where I'm stuck...
                ORIGINAL_CONTIG_LENGTH = len(record_dict[str(CONTIG)]) 
                #ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample). 

                OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
                #OUTPUT = 0    chr1:0-40129801    40129801    chr1    223705999
OUTPUT.close()

Find length of a contig in one fasta, using the header of another fasta as query in python

2 Answers2