0

I'm trying to write a script that retrieves data from GenBank files. I only need the info until the COMMENT part of the annotation.

This is my input:

LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the
            Ensembl or EnsemblGenomes web site, http://www.ensembl.org/ or
            http://www.ensemblgenomes.org/ for more information.

The current script:

$genbank = <STDIN>;
chomp ($genbank);
open (READ, "<$genbank") or die;
@data = <READ>;
close READ;

$end= $#data;
for ($line= 0; $line<= $end; $line++){
    if ($data[$line] =~ /LOCUS/){
        @annotation = (@annotation, $data[$line]);
        until ($data[$line] =~ /COMMENT/){
            $line++;
            @annotation = (@annotation, $data[$line]);
}}}
print @annotation;

And its OUTPUT:


LOCUS       mitochondrion_genome 19524 bp DNA HTG 17-DEC-2022
DEFINITION  Drosophila melanogaster primary_assembly mitochondrion_genome BDGP6.32 full
            sequence 1..19524 reannotated via EnsEMBL
ACCESSION   primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1
VERSION     mitochondrion_genomeBDGP6.32
KEYWORDS    .
SOURCE      fruit fly
  ORGANISM  Drosophila melanogaster
            Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;
            Protostomia; Ecdysozoa; Panarthropoda; Arthropoda; Mandibulata;
            Pancrustacea; Hexapoda; Insecta; Dicondylia; Pterygota; Neoptera;
            Endopterygota; Diptera; Brachycera; Muscomorpha; Eremoneura;
            Cyclorrhapha; Schizophora; Acalyptratae; Ephydroidea;
            Drosophilidae; Drosophilinae.
COMMENT     This sequence was annotated by FlyBase (https://www.flybase.org). Please visit the

As you can see, there's an issue with this method.

How can I modify the code so it retrieves the data but stops at COMMENT and doesn't retrieve the entire line?

The first line of all GenBank files starts with LOCUS and I suppose this can be used to write a better code (so it can be done without a regex match to the word). I'm clueless on how it can be done though. I will really appreciate your input!!

  • 1
    This could work? `last if($line =~ /COMMENT/);` – vkk05 Jun 05 '23 at 11:18
  • @vkk05 Just so I understand properly. You're suggesting that I replace the "until ($data[$line] =~ /COMMENT/)" with the line you provided, right? –  Jun 05 '23 at 19:51
  • cross posted: https://www.biostars.org/p/9565435/ – Pierre Jun 05 '23 at 21:27
  • @Pierre Is cross-posting not allowed? –  Jun 05 '23 at 22:05
  • If you do a lot of work with Perl and bioinformatics file formats (Genbank, Fasta _etc_.), [learn BioPerl](https://bioperl.org/howtos/index.html). It was developed so as people don't have to keep reinventing the wheel for these kinds of tasks. – neilfws Jun 05 '23 at 22:50
  • 1
    @neilfws oh, yes! I kept getting pointed to BioPerl while researching GenBank parsing and I know how useful it is, but my bioinfo teacher is very keen on having us reinvent the wheel with this kind of exercises. –  Jun 05 '23 at 23:00
  • @SteerClearOfFungi yes, right. – vkk05 Jun 06 '23 at 05:19
  • @vkk05 I couldn't apply it with the messy code I had, but somebody showed me how to use the method you suggested so rewrote it and it works flawlessly. Thank you very much! –  Jun 06 '23 at 06:51
  • @SteerClearOfFungi no but it's better to tell about it so people don't spoil their time on finding a solution while a solution was found elsewhere. – Pierre Jun 06 '23 at 07:02

1 Answers1

1

That looks a lot more complicated than it needs to be. I'd go with something like:

use strict;
use warnings;

my $print = 0;
open(IN, '<', 'genbank.txt');
open(OUT, ">genbank-out-without-comment.txt") or die "opsala $!";
while(<IN>){
  if(/^LOCUS\s/){
    $print = 1;
  }
  if(/^COMMENT\s/i){
    print "\n"; # preserve new line between entries
    $print = 0;
  }
  print OUT if $print;
}
close(OUT);
vwhh
  • 3
  • 1
Tom Melly
  • 353
  • 1
  • 8
  • This works great and is a way better method!! Now I'll try to figure out how to extract the retrieved data in a file with your method. –  Jun 05 '23 at 22:32