Modyfing a Genbank file

Question

Hi i am trying to search through a file for a specific list of words. If one of those words if found i want to add a newline underneath and add this phrase \colour = 1 (I don't want to remove the orginal word i am searching for).

An extract of the file for context and format:
LOCUS       contig_2_pilon_pilon 5558986 bp    DNA     linear   BCT 16-JUN-2020
DEFINITION  Escherichia coli O157:H7 strain (270078)
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Escherichia coli 270078
  ORGANISM  Escherichia coli 270078
            Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
COMMENT     Annotated using prokka 1.14.6 from
            https://github.com/tseemann/prokka.
FEATURES             Location/Qualifiers
     source          1..5558986
                     /organism="Escherichia coli 270078"
                     /mol_type="genomic DNA"
                     /strain="strain"
                     /db_xref="taxon:562"
     CDS             61523..61744
                     /gene="pspD"
                     /locus_tag="JCCJNNLA_00057"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:RefSeq:EG10779-MONOMER"
                     /codon_start=1
                     /transl_table=11
                     /product="peripheral inner membrane heat-shock protein"
                     /translation="MNTRWQQAGQKVKPGFKLAGKLVLLTALRYGPAGVAGWAIKSVA
                     RRPLKMLLAVALEPLLSRAANKLAQRYKR"

Here is one of the lists of words i am looking for throughout the file:

regulation_list=["anti-repressor","anti-termination","antirepressor","antitermination","antiterminator","anti-terminator","cold-shock","cold shock","heat-shock","heat shock","regulation","regulator","regulatory","helicase","antibiotic resistance","repressor","zinc","sensor","dipeptidase","deacetylase","5-dehydrogenase","glucosamine kinase","glucosamine-kinase","dna-binding","dna binding","methylase","sulfurtransferase","acetyltransferase","control","ATP-binding","ATP binding","Cro","Ren protein","CII","inhibitor","activator","derepression","protein Sxy","sensing","sensor","Tir chaperone","Tir-cytoskeleton","Tir cytoskeleton","Tir protein","EspD"]

As you can see that extract contains one of th ephrases i am looking for and i want to add a newline underneath with the phrase /colour = 1

Any help would be great!

You can simply read all lines of your file using `data = open('your_file').readlines()`, then loop over each line, and check if it contains any of your words. If it does, store its location somewhere. Once done add a line containing `/colour = 1` for all positions you stored. Does it make sense ? — Big Bro, Aug 21 '20 at 12:42

score 0 · Answer 1 · answered Aug 21 '20 at 15:59

# Create simple input file for testing:
cat > foo.txt <<EOF
foo
foo anti-termination
bar anti-repressor anti-termination
baz
EOF

python -c '
import re

# Using a shortened version of your list:
regulation_list=["anti-repressor", "anti-termination", "etc"]

# For speed and simplicity, compile the regular expression once, the reuse it later:
regulation_re = re.compile("|".join(regulation_list))

with open("foo.txt" , "r") as in_file:
    for line in in_file:
        line = line.strip()
        print(line)
        if re.search(regulation_re, line):
           print("/colour = 1")
' > bar.txt

cat bar.txt

Prints:

foo
foo anti-termination
/colour = 1
bar anti-repressor anti-termination
/colour = 1
baz

You may want to add an extra newline and extra blanks to your /colour=1 string for alignment (it was not clear from you question), like so :

print("\n                     /colour = 1")

Modyfing a Genbank file

1 Answers1