1

I have a GFF3 file (mainly a TSV file with 9 columns) and I'm trying to make some changes in the first column of my file in order to overwrite the modification to the file itself.

The GFF3 file looks like this:

## GFF3 file
## replicon1
## replicon2
replicon_1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon_1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon_2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon_2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

I wrote few lines of code in which I decide a certain symbol to change (e.g. "_") and the symbol I want to replace (e.g. "@"):

import os
import re
import argparse
import pandas as pd

def myfunc() -> tuple:
    ap.add_argument("-f", "--file", help="path to file")
    ap.add_argument("-i", "--input_word",help="Symbol to delete")
    ap.add_argument("-o", "--output_word", help="Symbol to insert")
    return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word

with open (my_file, 'r+') as f:
    rawfl = f.read()
    rawfl = re.sub(in_char, out_char, rawfl)
    f.seek(0)
    f.write(rawfl)
    f.close()

The output is something like this:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some@gene@1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some@gene@1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some@gene@2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some@gene@2;

As you can see, all the "_" has been changed in "@". I tried to modify the script using pandas in order to apply the modification only to the first column (seqid, here below):

with open (my_file, 'r+') as f:
    genomic_dataframe = pd.read_csv(f, sep="\t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
    id = genomic_dataframe.seqid
    id = str(id) #this is used because re.sub expects strings, not dataframe
    id = re.sub(in_char, out_char, genid)
    f.seek(0)
    f.write(genid)
f.close()

I do not obtain the expected result but something like the seqid column (correctly modified) that is added to file but not overwritten respect the original one.

What I'd like to obtain is something like this:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

Where the "@" symbol is present only in the first column while the "_" is maintained in the 9th column.

Do you know how to fix this? Thank you all.

2 Answers2

1

You can use re.sub with pattern that starts with ^ (start of the string) + use lambda function in re.sub. For example:

import re

# change only first column:
r = re.compile(r"^(.*?)(?=\s)")

in_char = "_"
out_char = "@"

with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
    for line in map(str.strip, f_in):
        # skip empty lines and lines starting with ##
        if not line or line.startswith("##"):
            print(line, file=f_out)
            continue

        line = r.sub(lambda g: g.group(1).replace(in_char, out_char), line)
        print(line, file=f_out)

Creates output_file.txt:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

If you only want to replace the first occurence of _ by @, you can do it this way without the need to load your file as a dataframe and without the use of any 3rd party lib such as pandas.

with open('f') as f:
    lines = [line.rstrip() for line in f]

for line in lines:
    # Ignore comments
    if line[0] == '#':
        continue
    line = line.replace('_', '@', 1)

This will return lines which contains

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;
Alaaaaa
  • 179
  • 6
  • It works but do not overwrite the original column, it adds a new one with the corrected character. I'd like to completely replace the column of the original file with the new symbol. – Iacopo Passeri Aug 22 '21 at 17:35