I have Flink job running in AWS Kinesis Analytics that does the following.
1 - I have Table on a Kinesis Stream - Called MainEvents.
2 - I have a Sink Table that is pointing to Kinesis Stream - Called perMinute.
The perMinute is populated using the MainEvents table as input and generates a sliding window(hop) agg.
So far so good.
My final consumer is a Kineis Python Script that reads the input from perMinute Kinesis Stream.
This is my Consumer Script.
stream_name = 'perMinute'
ses = boto3.session.Session()
kinesis_client = ses.client('kinesis')
response = kinesis_client.describe_stream(StreamName=stream_name)
shard_id = response['StreamDescription']['Shards'][0]['ShardId']
response = kinesis_client.get_shard_iterator(
StreamName=stream_name,
ShardId=shard_id,
ShardIteratorType='LATEST'
)
shard_iterator = response['ShardIterator']
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = str(record["Data"])
print(data)
time.sleep(1)
The issue i have is that i get encoded data, that looks like.
b'{"window_start":"2022-09-28 04:01:46","window_end":"2022-09-28 04:02:46","counts":300}'
b'{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}'
b'\xf3\x89\x9a\xc2\n$4a540599-485d-47c5-9a7e-ca46173b30de\n$2349a5a3-7949-4bde-95a8-4019a077586b\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}\xc3\xa1\xfe\xfa9j\xeb\x1aP\x917F\xf3\xd2\xb7\x02'
b'\xf3\x89\x9a\xc2\n$23a0d76c-6939-4eda-b5ee-8cd2b3dc1c1e\n$7ddf1c0c-16fe-47a0-bd99-ef9470cade28\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:30","window_end":"2022-09-28 04:03:30","counts":531}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:36","window_end":"2022-09-28 04:03:36","counts":560}\x0c>.\xbd\x0b\xac.\x9a\xe8z\x04\x850\xd5\xa6\xb3'
b'\xf3\x89\x9a\xc2\n$2cacfdf8-a09b-4fa3-b032-6f1707c966c3\n$27458e17-8a3a-434e-9afd-4995c8e6a1a4\n$11774332-d906-4486-a959-28ceec0d134a\x1aY\x08\x00\x1aU{"window_start":"2022-09-28 04:02:42","window_end":"2022-09-28 04:03:42","counts":1625}\x1aY\x08\x01\x1aU{"window_start":"2022-09-28 04:02:50","window_end":"2022-09-28 04:03:50","counts":2713}\x1aY\x08\x02\x1aU{"window_start":"2022-09-28 04:03:00","window_end":"2022-09-28 04:04:00","counts":3009}\xe1G\x18\xe7_a\x07\xd3\x81O\x03\xf9Q\xaa\x0b_'
Some Records are valid, the first two and the other records seems to have multiple entries on the same row.
How can i get rid of the extra characters that are not part of the json payload and get one line per invocation ?
If i would use decode('utf-8'), i get few record out ok but when it reaches a point if fails with:
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = record["Data"].decode('utf-8')
# data = record["Data"].decode('latin-1')
print(data)
time.sleep(1)
{"window_start":"2022-09-28 03:59:24","window_end":"2022-09-28 04:00:24","counts":319}
{"window_start":"2022-09-28 03:59:28","window_end":"2022-09-28 04:00:28","counts":366}
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-108-0e632a57c871> in <module>
39 shard_iterator = result["NextShardIterator"]
40 for record in records:
---> 41 data = record["Data"].decode('utf-8')
43 print(data)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
If i use decode('latin-1') it does not fail but i get alot of crazy text out
{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}
óÂ
$4a540599-485d-47c5-9a7e-ca46173b30de
$2349a5a3-7949-4bde-95a8-4019a077586bXT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}XT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}áþú9jëP7FóÒ·
óÂ
here is the stream producer flink code
-- create sink
CREATE TABLE perMinute (
window_start TIMESTAMP(3) NOT NULL,
window_end TIMESTAMP(3) NOT NULL,
counts BIGINT NOT NULL
)
WITH (
'connector' = 'kinesis',
'stream' = 'perMinute',
'aws.region' = 'ap-southeast-2',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);
%flink.ssql(type=update)
insert into perMinute
SELECT window_start, window_end, COUNT(DISTINCT event) as counts
FROM TABLE(
HOP(TABLE MainEvents, DESCRIPTOR(eventtime), INTERVAL '5' SECOND, INTERVAL '60' SECOND))
GROUP BY window_start, window_end;
Thanks