0

I have a gff file looks like this:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
### 

I wish to rename the ID names, starting from 0001, such that for the above gene the entry is:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2 

The above example is simply for one gene entry, but I wish to rename all genes, and their corresponding mRNA/exon, consecutively starting from ID = dd_0001. Any hints on how to do this would be much appreciated.

Alex Trevylan
  • 517
  • 7
  • 17
  • Please read [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask) before attempting to ask more questions. –  Mar 24 '17 at 18:30

1 Answers1

1

The file needs to be opened, then the id replaced line by line.
Here is the docs reference for file I/O and str.replace().

gff_filename = 'filename.gff'
replace_string = 'dd_g4_1G94'
replace_with = 'dd_0001'

lines = []
with open(gff_filename, 'r') as gff_file:
    for line in gff_file:
        line = line.replace(replace_string, replace_with)
        lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

Tested in Windows 10, Python 3.5.1, this works.

To search for ids, you should use regex.

import re

gff_filename = 'filename.gff'
replace_with = 'dd_{}'
re_pattern = r'ID=(.*?)[;.]'

ids  = []
lines = []
with open(gff_filename, 'r') as gff_file:
    file_lines = [line for line in gff_file]

for line in file_lines:
    matches = re.findall(re_pattern, line)
    for found_id in matches:
        if found_id not in ids:
            ids.append(found_id)

for line in file_lines:
    for ID in ids:
        if ID in line:
            id_suffix = str(ids.index(ID)).zfill(4)
            line = line.replace(ID, replace_with.format(id_suffix))
    lines.append(line)

with open(gff_filename, 'w') as gff_file:
    gff_file.writelines(lines)

There are other ways of doing this, but this is quite robust.

Lupilum
  • 363
  • 2
  • 11
  • This is very useful, thank you. The only issue is that I have thousands of IDs to replace. Is there any way in python, to say whenever a new un-replaced ID is found, replace with a consecutive ID starting from 0000? – Alex Trevylan Mar 23 '17 at 18:23
  • Thank you very much for helping - I just need some hints as to how to automate this process for thousands of IDs, and replace them with new ones starting from dd_0001... – Alex Trevylan Mar 23 '17 at 18:28
  • 1
    Yes, sorry. I couldn't edit my comment after 5 minutes. I will edit my answer to show how that can be done. – Lupilum Mar 23 '17 at 18:31
  • Now it should do what you need it to do. – Lupilum Mar 23 '17 at 19:41
  • Sorry, I edited it still, now it should work. It didn't work when there were two different ids on the same line. Btw, is there supposed to be a 'dd_g4_1G94' on the fifth line in your output example? – Lupilum Mar 23 '17 at 19:49
  • The backslash in the negated character class looks misplaced, unless you specifically intended to include literal backslashes in the negation, too. Inside a character class, a dot is a simple literal which does not need any escaping. – tripleee Dec 29 '22 at 10:56
  • Good point tripleee, I edited the pattern. – Lupilum Dec 31 '22 at 00:54