I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like to save the output of this script as a variable to then use later on in the same pipeline.
Below is an updated version of the script using code provided by Lucía Balestrazzi.
Additional information: The following with-statement is nested inside a larger for-loop that cycles through subsamples of an original genome. The first subsample fasta in my directory has a single sequence ">chr1:0-40129801" with a length of 40129801. I'm trying to write out a text file "OUTPUT" that has some basic information about each subsample fasta. This text file will be used as an input for another program downstream.
Header names in the original fasta file are chr1, chr2, etc... while the header names in the subsample fastas are something along the lines of:
batch1.fa >chr1:0-40k
batch2.fa >chr1:40k-80k
...etc...
import Bio.SeqIO as IO
record_dict = IO.to_dict(IO.parse(ORIGINAL_GENOME, "fasta")) #not the subsample
with open(GENOME_SUBSAMPLE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
#Example of "LINE"... >chr1:0-40129801
HEADER = re.sub('>','',LINE)
#HEADER = chr1:0-40129801
HEADER2 = re.sub('\n','',HEADER)
#HEADER2 = chr1:0-40129801 (no return character on the end)
CONTIG = HEADER2.split(":")[0]
#CONTIG = chr1
PART2_HEADER = HEADER2.split(":")[1]
#PART2_HEADER = 0-40129801
START = int(PART2_HEADER.split("-")[0])
#START = 0
END = int(PART2_HEADER.split("-")[1])
#END = 40129801
LENGTH = END-START
#LENGTH = 40129801 minus 0 = 40129801
#This is where I'm stuck...
ORIGINAL_CONTIG_LENGTH = len(record_dict[CONTIG]) #This returns "KeyError: 1"
#ORIGINAL_CONTIG_LENGTH = 223705999 (this is from the full genome, not the subsample).
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(ORIGINAL_CONTIG_LENGTH) + '\n')
#OUTPUT = 0 chr1:0-40129801 40129801 chr1 223705999
OUTPUT.close()
I'm relatively new to bioinformatics. I know I'm messing up on how I'm using the dictionary, but I'm not quite sure how to fix it.
Any advice would be greatly appreciated. Thanks!