0

I have Flink job running in AWS Kinesis Analytics that does the following.

1 - I have Table on a Kinesis Stream - Called MainEvents.

2 - I have a Sink Table that is pointing to Kinesis Stream - Called perMinute.

The perMinute is populated using the MainEvents table as input and generates a sliding window(hop) agg.

So far so good.

My final consumer is a Kineis Python Script that reads the input from perMinute Kinesis Stream.

This is my Consumer Script.

    stream_name = 'perMinute'
    ses = boto3.session.Session()
    kinesis_client = ses.client('kinesis')
    response = kinesis_client.describe_stream(StreamName=stream_name)
    shard_id = response['StreamDescription']['Shards'][0]['ShardId']

    response = kinesis_client.get_shard_iterator(
        StreamName=stream_name,
        ShardId=shard_id,
        ShardIteratorType='LATEST'
    )
    shard_iterator = response['ShardIterator']

    while shard_iterator is not None:
        result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
        records = result["Records"]
        shard_iterator = result["NextShardIterator"]
        for record in records:
            data = str(record["Data"])
            print(data)
            time.sleep(1)

The issue i have is that i get encoded data, that looks like.

    b'{"window_start":"2022-09-28 04:01:46","window_end":"2022-09-28 04:02:46","counts":300}'
b'{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}'
b'\xf3\x89\x9a\xc2\n$4a540599-485d-47c5-9a7e-ca46173b30de\n$2349a5a3-7949-4bde-95a8-4019a077586b\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}\xc3\xa1\xfe\xfa9j\xeb\x1aP\x917F\xf3\xd2\xb7\x02'
b'\xf3\x89\x9a\xc2\n$23a0d76c-6939-4eda-b5ee-8cd2b3dc1c1e\n$7ddf1c0c-16fe-47a0-bd99-ef9470cade28\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:30","window_end":"2022-09-28 04:03:30","counts":531}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:36","window_end":"2022-09-28 04:03:36","counts":560}\x0c>.\xbd\x0b\xac.\x9a\xe8z\x04\x850\xd5\xa6\xb3'
b'\xf3\x89\x9a\xc2\n$2cacfdf8-a09b-4fa3-b032-6f1707c966c3\n$27458e17-8a3a-434e-9afd-4995c8e6a1a4\n$11774332-d906-4486-a959-28ceec0d134a\x1aY\x08\x00\x1aU{"window_start":"2022-09-28 04:02:42","window_end":"2022-09-28 04:03:42","counts":1625}\x1aY\x08\x01\x1aU{"window_start":"2022-09-28 04:02:50","window_end":"2022-09-28 04:03:50","counts":2713}\x1aY\x08\x02\x1aU{"window_start":"2022-09-28 04:03:00","window_end":"2022-09-28 04:04:00","counts":3009}\xe1G\x18\xe7_a\x07\xd3\x81O\x03\xf9Q\xaa\x0b_'

Some Records are valid, the first two and the other records seems to have multiple entries on the same row.

How can i get rid of the extra characters that are not part of the json payload and get one line per invocation ?

If i would use decode('utf-8'), i get few record out ok but when it reaches a point if fails with:

while shard_iterator is not None:
    result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
    records = result["Records"]
    shard_iterator = result["NextShardIterator"]
    for record in records:
        data = record["Data"].decode('utf-8')
        # data = record["Data"].decode('latin-1')
        print(data)
        time.sleep(1)
    {"window_start":"2022-09-28 03:59:24","window_end":"2022-09-28 04:00:24","counts":319}
{"window_start":"2022-09-28 03:59:28","window_end":"2022-09-28 04:00:28","counts":366}
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-108-0e632a57c871> in <module>
     39     shard_iterator = result["NextShardIterator"]
     40     for record in records:
---> 41         data = record["Data"].decode('utf-8')
     43         print(data)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte

If i use decode('latin-1') it does not fail but i get alot of crazy text out

{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}
óÂ
$4a540599-485d-47c5-9a7e-ca46173b30de
$2349a5a3-7949-4bde-95a8-4019a077586bXT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}XT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}áþú9jëP7FóÒ·
óÂ

here is the stream producer flink code

    -- create sink 
CREATE TABLE perMinute (
window_start TIMESTAMP(3) NOT NULL,
window_end TIMESTAMP(3) NOT NULL,
counts BIGINT NOT NULL
)
WITH (
'connector' = 'kinesis',
'stream' = 'perMinute',
'aws.region' = 'ap-southeast-2',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

%flink.ssql(type=update)
insert into perMinute
SELECT window_start, window_end, COUNT(DISTINCT event) as counts
  FROM TABLE(
    HOP(TABLE MainEvents, DESCRIPTOR(eventtime), INTERVAL '5' SECOND, INTERVAL '60' SECOND))
  GROUP BY  window_start, window_end;

Thanks

Up_One
  • 5,213
  • 3
  • 33
  • 65
  • I think it would be better for you to share the Flink code instead. At least if you trust boto3 docs: "The data in the blob is both opaque and immutable to Kinesis Data Streams, which does not inspect, interpret, or change the data in the blob in any way." – bzu Oct 01 '22 at 09:49
  • added the flink code that populates the stream @bzu – Up_One Oct 01 '22 at 19:56
  • 1
    Looks quite weird. I would: 1. put some test JSON to the same Kinesis stream with boto3 to verify your reading code works OK and 2. output perMinute table to S3 to see what the records look like there. – bzu Oct 02 '22 at 09:09

0 Answers0