0

I need to edit the gene names of a gff file, as shown below. Original file:

chr1    aug gene    10708   108196  .   -   .   ID=gene:g754;biotype=protein_coding
chr1    aug exon    10708   107528  .   -   .   Parent=transcript:g754;Name=g754_T001.exon.1;exon_id=g754_T001.exon.1
chr1    aug gene    20588   20898   .   -   .   ID=gene:g756;biotype=protein_coding
chr1    aug mRNA    20588   20898   .   -   .   ID=transcript:g756;Parent=gene:g756;biotype=protein_coding;transcript_id=g756_T001
chr1    aug exon    20588   20690   .   -   .   Parent=transcript:g756_T001;Name=g756_T001.exon.1;exon_id=g756_T001.exon.1

New file:

chr1    aug gene    10708   108196  .   -   .   ID=gene:Gene00001;biotype=protein_coding
chr1    aug exon    10708   107528  .   -   .   Parent=transcript:Gene00001;Name=Gene00001_T001.exon.1;exon_id=Gene00001_T001.exon.1
chr1    aug gene    20588   20898   .   -   .   ID=gene:Gene00002;biotype=protein_coding
chr1    aug mRNA    20588   20898   .   -   .   ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
chr1    aug exon    20588   20690   .   -   .   Parent=transcript:Gene00002_T001;Name=Gene00002_T001.exon.1;exon_id=Gene00002_T001.exon.1

As input I have the gff file and a list with the current and new gene name keys.

g754 Gene00001
g756 Gene00002

I have written a script in python to replace the old gene names with the new gene name. The replace command works as expected, but a newline is inserted after every time the string is replaced. I don't know why this is happening and google is failing me. I did try to mimic the solution here: Renaming Name ID in gffile., but I have a separate gene name key file. I am using anaconda/python3.6


Current code:

import sys
import getopt
import operator

in_gff = open("current_gff_file.gff3", "r")
out_gff = open("new_file.gff", "w")
name_key = open("name_key_file.txt", "r")

current_name = []
new_name = []
#create 2 lists of current and new names                                                                                                                           
for name_row in name_key:
    name_field = name_row.split("\t")
    current_name.append(name_field[0])
    new_name.append(name_field[1])

for row in in_gff:
    line = row.rstrip()
    if line.startswith("#"):
        print(line, file = out_gff, end = "\n") #if it is a header line just print to new file
    else: #loop through list of current gene names
        for name in range(len(current_name)):
            if current_name[name] in line:                                                   
                new_line = line.replace(current_name[name], new_name[name])                                                                                                                        
                print(new_line) #test loop by printing to screen, line breaks happen after every string replacement
                #Output I want: ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
                #Output I get: ID=transcript:Gene00002
                #Parent=gene:Gene00002
                #biotype=protein_coding;transcript_id=Gene00002
                #_T001                                                                                                     
            else:
                continue

c_o
  • 3
  • 2

2 Answers2

0

I think this is much easier to solve with regex. I put your data in a file called original.txt and new_gene_names.txt and output the result to output.txt.

import re

# Note there is a lookahead and a look behind
# This expression matches any string that starts with
# `gene:` and ends with `;` and the pattern pulls out
# what is between those two things.
pattern = re.compile(r'(?<=gene:)(.*?)(?=;)')

with open('new_gene_names.txt') as gene_names,\
     open('original.txt') as f,\
     open('output.txt', 'w') as out_f:
    # A dictionary mapping gene names to what they will be changed to
    rename_dict = dict(line.split() for line in gene_names)
    for line in f:
        # Search for the gene name
        result = pattern.search(line)
        # If we found a gene name and we can rename it then we substitute it
        if result and result.group(0) in rename_dict:
            line = pattern.sub(rename_dict[result.group(0)], line)
        out_f.write(line)

Update - to match the part in Name= as well just change the regex t:

# This expression matches any string that starts with
# `gene:` and ends with `;`  OR starts with `Name=` and ends with `_`.
# and the pattern pulls out what is between the start and the end.
pattern = re.compile(r'(?<=gene:)(.*?)(?=;)|(?<=Name=)(.*?)(?=_)')

See the regex in action.

  • Thanks for this solution. However, the names that need to be replaced don't only come after the word 'gene' but also after Name=, transcipt: etc. Would I have to loop through multiple regular expressions? – c_o Jan 17 '20 at 03:33
  • No, you just need to update the one expression to include that pattern as well. See my update above. I also updated the regex link. Basically you add a `|` to say `or` and then do another expression to match what else you want. Note that this assumes the contents of `gene:` and the contents of `Name=` before the `_` are the same. If they are not, bugs may occur. – Error - Syntactical Remorse Jan 17 '20 at 13:25
  • ^ if `gene:` and `Name=` occur on the same line, which in your example they do not so this should be fine. – Error - Syntactical Remorse Jan 17 '20 at 13:32
0

When iterating through a file, each line still includes the trailing newline. Strip it away when building your translation table:

for name_row in name_key:
    name_field = name_row.split("\t")
    current_name.append(name_field[0])
    new_name.append(name_field[1].strip('\n'))  # store stripped names only
MisterMiyagi
  • 44,374
  • 10
  • 104
  • 119