I have a Kinesis Analytics SQL application (legacy) that computes most frequent items (top 10) in 1 minute window using TOP_K_ITEMS_TUMBLING function:
CREATE OR REPLACE STREAM "TOP_N_STREAM"
("myItem" VARCHAR(256), "frequency" BIGINT);
CREATE OR REPLACE PUMP "TOP_N_PUMP" AS
INSERT INTO "TOP_N_STREAM"
SELECT STREAM *
FROM TABLE (TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM * FROM "SOURCE_STREAM"),
'myItem',
10, -- top N
60) -- 1 minute window
);
I have configured a lambda as destination for this stream, so I can do some processing with these top 10 items. The problem is that it seems that not all the data is delivered to the lambda. For example if I have a very frequent item respect to others, this item is never delivered to the lambda.
For example considering this data output from the TOP_N_STREAM
item1 217342
item2 1411
item3 1284
item4 1092
item5 975
item6 661
item7 645
item8 381
item9 335
item10 319
item1
will never be delivered to the lambda, or at least it never shows up in the lambda logs. Anyone has any clue why does this happen? Is it something related to number of shards/computing power/concurrency?