0

Is it possible to output the gene location for a CDS feature or do I need to parse the 'location' or 'complement' field myself?

For example,

seq = Sequence.read(genbank_fp, format='genbank')
for feature in seq.metadata['FEATURES']:
    if feature['type_'] == 'CDS':
        if 'location' in feature:
            print 'location = ', feature['location']
        elif 'complement' in feature:
            print 'location = ', feature['complement']
        else:
            raise ValueError('positions for gene %s not found' % feature['protein_id'])

would output:

location = <1..206

location = 687..3158

for this sample GenBank file.

This functionality is possible in BioPython (see this thread) where I can output the positions already parsed (ex. start = 687, end = 3158).

Thanks!

1 Answers1

1

For the example, you can get the Sequence object for the feature only, using the following code:

# column index in positional metadata
col = feature['index_']
loc = seq.positional_metadata[col]
feature_seq = seq[loc]
# if the feature is on reverse strand
if feature['rc_']:
    feature_seq = feature_seq.reverse_complement()

Note: the GenBank parser is newly added in the development branches.

RNA
  • 146,987
  • 15
  • 52
  • 70
  • Thanks for the reply! However I'm looking for the actual start and end positions of the gene (not the gene sequence itself). To extend on your answer, I need to find the index of the first and last True element in the loc Series: `start_pos = loc[loc == True].index[0]; end_pos = loc[loc == True].index[-1]` – Jenya Kopylova Sep 24 '15 at 13:03