I need to edit the gene names of a gff file, as shown below. Original file:
chr1 aug gene 10708 108196 . - . ID=gene:g754;biotype=protein_coding
chr1 aug exon 10708 107528 . - . Parent=transcript:g754;Name=g754_T001.exon.1;exon_id=g754_T001.exon.1
chr1 aug gene 20588 20898 . - . ID=gene:g756;biotype=protein_coding
chr1 aug mRNA 20588 20898 . - . ID=transcript:g756;Parent=gene:g756;biotype=protein_coding;transcript_id=g756_T001
chr1 aug exon 20588 20690 . - . Parent=transcript:g756_T001;Name=g756_T001.exon.1;exon_id=g756_T001.exon.1
New file:
chr1 aug gene 10708 108196 . - . ID=gene:Gene00001;biotype=protein_coding
chr1 aug exon 10708 107528 . - . Parent=transcript:Gene00001;Name=Gene00001_T001.exon.1;exon_id=Gene00001_T001.exon.1
chr1 aug gene 20588 20898 . - . ID=gene:Gene00002;biotype=protein_coding
chr1 aug mRNA 20588 20898 . - . ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
chr1 aug exon 20588 20690 . - . Parent=transcript:Gene00002_T001;Name=Gene00002_T001.exon.1;exon_id=Gene00002_T001.exon.1
As input I have the gff file and a list with the current and new gene name keys.
g754 Gene00001
g756 Gene00002
I have written a script in python to replace the old gene names with the new gene name. The replace command works as expected, but a newline is inserted after every time the string is replaced. I don't know why this is happening and google is failing me. I did try to mimic the solution here: Renaming Name ID in gffile., but I have a separate gene name key file. I am using anaconda/python3.6
Current code:
import sys
import getopt
import operator
in_gff = open("current_gff_file.gff3", "r")
out_gff = open("new_file.gff", "w")
name_key = open("name_key_file.txt", "r")
current_name = []
new_name = []
#create 2 lists of current and new names
for name_row in name_key:
name_field = name_row.split("\t")
current_name.append(name_field[0])
new_name.append(name_field[1])
for row in in_gff:
line = row.rstrip()
if line.startswith("#"):
print(line, file = out_gff, end = "\n") #if it is a header line just print to new file
else: #loop through list of current gene names
for name in range(len(current_name)):
if current_name[name] in line:
new_line = line.replace(current_name[name], new_name[name])
print(new_line) #test loop by printing to screen, line breaks happen after every string replacement
#Output I want: ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
#Output I get: ID=transcript:Gene00002
#Parent=gene:Gene00002
#biotype=protein_coding;transcript_id=Gene00002
#_T001
else:
continue