0

Please guide me to convert a GenBank sequence to its equivalent FASTA format using biosmalltalk (Pharo edition). I have already figured out to read a GenBank file from disk: | file x y m | x:=Time millisecondClockValue . file := BioFile on: (FileStream readOnlyFileNamed: BioObject testFilesDirectoryName asFileReference / 'AF165912.gbk'). m:=BioParser tokenizeMultiFasta: file contents. y:=Time millisecondClockValue. Transcript open. Transcript clear. Transcript show:m;cr. Now I want to get its FASTA equivalent.

Dip Moitra
  • 11
  • 2
  • Welcome to Stack Overflow! For some reason you chose not to take the [Tour](http://stackoverflow.com/tour) when registering as a new user; please do so as soon as possible. As it is, your question is not on topic for Stack Overflow. Please read [What topics can I ask about here?](http://stackoverflow.com/help/on-topic) and [How do I ask a good question?](http://stackoverflow.com/help/how-to-ask). – Jongware Sep 26 '14 at 09:56

1 Answers1

1

The GenBank format is (supposed to be) a human-readable format but it is not really easily parsable. Lot of efforts has been spent, and still today, in programming libraries to parse the flat GenBank format, when the XML format wasn't available or used at all. One of the goals behind BioSmalltalk is to focus on complexity reduction, which implies using the right tools. For that reason a GenBank flat parser is not included expecting to favor the usage of the GenBank XML format.

To give a try first install the latest BioSmalltalk in clean Pharo 3.0 evaluating the following command:

$ pharo Pharo.image "config" "http://smalltalkhub.com/mc/hernan/BioSmalltalk" "ConfigurationOfBioSmalltalk" --printVersion --install=development

or its equivalence from inside the image:

Gofer it
  smalltalkhubUser: 'hernan' project: 'BioSmalltalk';
  configuration;
  loadDevelopment.

To parse a GenBank XML formatted file, I highly recommend you to re-download your files in XML format in a reproducible way. If you downloaded your files from NCBI, you can use the Entrez e-Utils BioSmalltalk client (currently the NCBI removed the XML download option from the web page):

The following script downloads two GenBank records in XML, filter nodes by sequence definition and sequence string, and exports them in FASTA format. The sequence is in the GBSeq_sequence node.

| gbReader fastaCollection seqsWithDefs |
fastaCollection := BioFastaMultiRecord new.
gbReader := (BioEntrezClient new nuccore
    uids: #(57240072 57240071);
    setModeXML;
    fetch) reader.
seqsWithDefs := gbReader
    selectNodes: #('GBSeq_definition' 'GBSeq_sequence')
    in: gbReader contents.
(seqsWithDefs at: 'GBSeq_definition') with: (seqsWithDefs at: 'GBSeq_sequence') do: [ : defs : seqs |
    fastaCollection addFastaRecord: (BioFastaRecord named: defs value sequence: seqs value) ].
BioFASTAFormatter new exportFrom: fastaCollection sequences.

If you are starting with Smalltalk, remember you have pure objects and almost everything can be discovered through the inspector/explorer tools.

Hernán
  • 1,749
  • 10
  • 12
  • Dear Hernán, Thanks for your valuable advice. But I beg to differ a bit. As we can convert GenBank files on disk by using several scripting languages like BioPerl or BioJava: – Dip Moitra Sep 29 '14 at 17:31
  • Dear Hernán, Thanks for your valuable advice. But I beg to differ a bit. As we can convert GenBank files on disk by using several scripting languages like BioPerl or BioJava. I wish I could do the same using Biosmalltalk, too.By the way, I am presently using BioSmalltalk 0.5. The BioPerl code I am using for the same is as follows:my $infilename = 'AF165912.gbk'; my $outfilename = 'out_PerlAF165912.fa'; #reads an array of sequences @seq_object_array = read_all_sequences($infilename,'genbank'); write_sequence(">$outfilename", 'fasta', @seq_object_array); Thanks for your help. – Dip Moitra Sep 29 '14 at 17:54