I need to extract gene length from a gene sequence based on CCNA number (extract a values for given queries)

Question

I have my data in the format shown below. I have some specific CCNA numbers and want to extract the location values in (202-1107, 1700-2557,..) format. How can I code it in Perl?

>lcl|CP001340.1_cds_ACL93468.1_1 [locus_tag=CCNA_00001] [protein=pyruvate, phosphate dikinase regulatory protein] [protein_id=ACL93468.1] [location=202..1107] [gbkey=CDS]
GTGGTTAAGCAACCGTTAACGGATGATCCACAGGAGAGTCTGGCGCAGGGCGAGAGCGAAAGGCTGCCGC
CACGCTTCGCCACCTACTTCCATATCCACTTGGTTTCAGACTCCACAGGCGAGACGCTGAACGCGATGGC
GCGGGCGGT
>lcl|CP001340.1_cds_ACL93470.1_3 [locus_tag=CCNA_00003] [protein=shikimate 5-dehydrogenase] [protein_id=ACL93470.1] [location=1700..2557] [gbkey=CDS]
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC

Look for a module that can read those kind of files. For example on metacpan. — TLP, Sep 04 '21 at 09:37
Bioperl probably has a reader for the format (looks like fasta?) — Shawn, Sep 04 '21 at 09:50
Have you tried anything? Are you having some specific issues? If so, please show us your code and explain what specific problems you have. — Dada, Sep 04 '21 at 10:31
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Sep 08 '21 at 07:35

score 1 · Answer 1 · answered Sep 04 '21 at 09:50

Following sample code demonstrate how you can

define a search criteria
read data file
form an array of records
fill data structure with genome data chunks
output data matching the search criteria

use strict;
use warnings;
use feature 'say';

my($protein_id,$location) = ('ACL93470.1','1700..2557');  # search criteria
my(@chain,$genome);
my($proteins,$id);

# read genome data into array @chain 

while( <DATA> ) {
    push @chain, $genome if defined $genome and />lcl/;
    $genome = undef if />lcl/;
    $genome .= $_;
}

push @chain, $genome if defined $genome;

# build data structure $proteins

for( @chain ) {
    ($id) = /^>lcl\|(\S+)/;
    my @elements = /\[(.*?)\]/g;
    $proteins->{$id} = { map { split('=', $_) } @elements };
    ($proteins->{$id}{protein}) = /([^]]*)\z/;
}

# output data of search criteria

for( keys %$proteins ) {
    if( $proteins->{$_}{protein_id} eq $protein_id and $proteins->{$_}{location} eq $location ) {
        say "Protein ID: $protein_id\n"
          . "Location:   $location\n"
          . "Protein:    $proteins->{$_}{protein}";
    }
} 

__DATA__
>lcl|CP001340.1_cds_ACL93468.1_1 [locus_tag=CCNA_00001] [protein=pyruvate, phosphate dikinase regulatory protein] [protein_id=ACL93468.1] [location=202..1107] [gbkey=CDS]
GTGGTTAAGCAACCGTTAACGGATGATCCACAGGAGAGTCTGGCGCAGGGCGAGAGCGAAAGGCTGCCGC
CACGCTTCGCCACCTACTTCCATATCCACTTGGTTTCAGACTCCACAGGCGAGACGCTGAACGCGATGGC
GCGGGCGGT
>lcl|CP001340.1_cds_ACL93470.1_3 [locus_tag=CCNA_00003] [protein=shikimate 5-dehydrogenase] [protein_id=ACL93470.1] [location=1700..2557] [gbkey=CDS]
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC

Output

Protein ID: ACL93470.1
Location:   1700..2557
Protein:
ATGACCAACGCCATCACGGGCGCGGCCATTGTCGGCGGTGTCTGCGGTCAACCGATCAAGCATTCGATGA
GCCCGGTGATCCACAACGCCTGGATCGCAGCGGCCGGCCTTGACGCGGCTTATGTGCCATTCGCCCCGGC

I need to extract gene length from a gene sequence based on CCNA number (extract a values for given queries)

1 Answers1