29

I cant seem to find a decent example that shows how can I consume an AWS Kinesis stream via Python. Can someone please provide me with some examples I could look into?

Best

aliirz
  • 1,008
  • 2
  • 13
  • 25

2 Answers2

39

you should use boto.kinesis:

from boto import kinesis

After you created a stream:

step 1: connect to aws kinesis:

auth = {"aws_access_key_id":"id", "aws_secret_access_key":"key"}
connection = kinesis.connect_to_region('us-east-1',**auth)

step 2: get the stream info (like how many shards, if it is active ..)

tries = 0
while tries < 10:
    tries += 1
    time.sleep(1)
    try:
        response = connection.describe_stream('stream_name')   
        if response['StreamDescription']['StreamStatus'] == 'ACTIVE':
            break 
    except :
        logger.error('error while trying to describe kinesis stream : %s')
else:
    raise TimeoutError('Stream is still not active, aborting...')

step 3 : get all shard ids, and for each shared id get the shard iterator:

shard_ids = []
stream_name = None 
if response and 'StreamDescription' in response:
    stream_name = response['StreamDescription']['StreamName']                   
    for shard_id in response['StreamDescription']['Shards']:
         shard_id = shard_id['ShardId']
         shard_iterator = connection.get_shard_iterator(stream_name, shard_id, shard_iterator_type)
         shard_ids.append({'shard_id' : shard_id ,'shard_iterator' : shard_iterator['ShardIterator'] })

step 4 : read the data for each shard

limit is the limit of records that you want to receive. (you can receive up to 10 MB) shard_iterator is the shared from previous step.

tries = 0
result = []
while tries < 100:
     tries += 1
     response = connection.get_records(shard_iterator = shard_iterator , limit = limit)
     shard_iterator = response['NextShardIterator']
     if len(response['Records'])> 0:
          for res in response['Records']: 
               result.append(res['Data'])                  
          return result , shard_iterator

in your next call to get_records, you should use the shard_iterator that you received with the result of the previous get_records.

note: in one call to get_records, (limit = None) you can receive empty records. if calling to get_records with a limit, you will get the records that are in the same partition key (when you put data in to stream, you have to use partition key :

connection.put_record(stream_name, data, partition_key)
Mr. L
  • 3,098
  • 4
  • 26
  • 26
Eyal Ch
  • 9,552
  • 5
  • 44
  • 54
  • sure, hope it helps.. :) – Eyal Ch Mar 20 '14 at 07:36
  • some stuff are not working but (shared_id?) but thanks for clues – Alon Mar 23 '14 at 10:28
  • which version of boto and python is the above example ? When I try to use 2.7 Python on Boto 2.30.0; I am able to put the data to Kinesis but when I read it, I am getting XML parse error. However I am able to read the data and write the data - same data using Java SDK. – Naveen Vijay Jul 25 '14 at 10:13
  • i am using python 2.7 , boto ver 2.29.1. – Eyal Ch Jul 27 '14 at 06:44
  • @EyalCh - given that this is currently the most complete example of using boto to consume Kinesis on the internet (that I can find), would it be possible to give a little more detail as to how you would implement a continuous polling function, including how one deals with new shards on a stream, running a consumer for each shard etc? – Erve1879 Oct 11 '14 at 11:37
  • 3
    Needless to say, I've now found a pretty complete example here: https://github.com/awslabs/kinesis-poster-worker! Nevertheless, thanks for the useful answer! – Erve1879 Oct 11 '14 at 12:22
  • @EyalCh can you let me know if can get specific record from 'Data' ? – n1tk Apr 11 '16 at 23:39
  • @sb0709 there is no such option today (also in boto api http://boto.cloudhackers.com/en/latest/ref/kinesis.html). kinessis represent a stream, and you can get records only by a certain location from a stream (and not by record id). – Eyal Ch Apr 12 '16 at 06:13
  • @EyalCh Thank you! – n1tk Apr 12 '16 at 16:39
  • @EyalCh - Is there any scope where we can put record in different USER,ROLE and ARN? – asur Nov 23 '18 at 10:26
11

While this question has already been answered, it might be a good idea for future readers to consider using the Kinesis Client Library (KCL) for Python instead of using boto directly. It simplifies consuming from the stream when you have multiple consumer instances, and/or changing shard configurations.

https://aws.amazon.com/blogs/aws/speak-to-kinesis-in-python/

A more complete enumeration of what the KCL provides

  • Connects to the stream
  • Enumerates the shards
  • Coordinates shard associations with other workers (if any)
  • Instantiates a record processor for every shard it manages
  • Pulls data records from the stream
  • Pushes the records to the corresponding record processor
  • Checkpoints processed records (it uses DynamoDB so your code doesn't have to manually persist the checkpoint value)
  • Balances shard-worker associations when the worker instance count changes
  • Balances shard-worker associations when shards are split or merged

The items in bold are the ones that I think are where the KCL really provides non-trivial value over boto. But depending on your usecase boto may be much much much simpler.

jumand
  • 872
  • 8
  • 17
  • 2
    Where was this when I was suffering :( – aliirz Sep 21 '15 at 05:02
  • Also, if you haven't used kinesis before, you may run into this. It's not directly related to the KCL, but the KCL helps make this scenario a little more mysterious. http://stackoverflow.com/questions/32863095/expected-behavior-for-aws-kinesis-sharditeratortype-trim-horizon – jumand Oct 21 '15 at 13:21
  • 1
    really dig my head into KCL, finding it is calling a daemon running in Java. This make my debugging and customizing my code really difficult in python. Boto gives me full control and I have to know how it work internally. – Robin Loxley Feb 01 '16 at 07:25
  • Using boto is definitely more straightforward, and it's clearer what's happening. But using the KCL "automatically" takes care of some non-trivial tasks if you have a more complex deployment. I'll update my original answer to point out some of those tasks. I'm not pushing KCL over the use of boto, just explaining when/how the KCL might outweigh boto's simplicity. – jumand Feb 02 '16 at 04:24