0

In order to manually modify a .gff file I have, I need to find the start position of my gene in the FASTA-formatted genome of my animal (i.e. what # base is it in the sequence?). I have the sequence of this gene.

How do I do this as easily as possible (this is not an animal whose genome is readily available on the internet)?

What I have: the genome, in FASTA format; a GFF file containing an annotation for this organism's genome (which needs to be sorely updated); the sequence of this gene.

Thank you!

kdickson
  • 23
  • 4

1 Answers1

0

If you know that the gene sequence is identical to that in the reference, do (using python)

import re
match = re.search(your_gene_seq, your_genome_seq)
if match:
    gene_start = match.start()
else:
    print("no match")

Otherwise, you will need to do a pairwise alignment of your gene to the reference

using Biopython:

python -m pip install biopython

from Bio import pairwise2
# alignment scores: match = 5, mismatch = -4, gap open = -2, gap extend = -0.5
alignment = pairwise2.align.globalms(your_gene_seq, your_genome_seq, 5, -4, -2, -0.5)[0]
gene_start = alignment[3]

to update the gff

use biopython

https://biopython.org/wiki/GFF_Parsing

Colin Anthony
  • 1,141
  • 12
  • 21