12

OK, I'll start with an elaborated use-case and will explain my question:

  1. I use a 3rd party web analytics platform which utilizes AWS Kinesis streams in order to pass data from the client into the final destination - a Kinesis stream;
  2. The web analytics platform uses 2 streams:
    1. A data collector stream (single shard stream);
    2. A second stream to enrich the raw data from the collector stream (single shard stream); Most importantly, this stream consumes the raw data from the first stream using TRIM_HORIZON iterator type;
  3. I consume the data from the stream using AWS Java SDK, secifically using the GetShardIteratorRequest class;
  4. I'm currently developing the extraction class, so this is done synchronously, meaning I consume data only when I compile my class;
  5. The class surprisingly works, although there are some things that I fail to understand, specifically with respect to how the data is consumed from the stream and the meaning of each one of iterator types;

My problem is that the data I retrieve is inconsistent and has no chronological logic in it.

  • When I use AT_SEQUENCE_NUMBER and provide the first sequence number from the shard with

    .getSequenceNumberRange().getStartingSequenceNumber();

    ... as the ``, I'm not getting all records. Similarly, AFTER_SEQUENCE_NUMBER;

  • When I use LATEST, I'm getting zero results;
  • When I use TRIM_HORIZON, which should make sense to use, it doesn't seem to be working fine. It used to provide me the data, and then I've added new "events" (records to the final stream) and I received zero records. Mystery.

My questions are:

  1. How can I safely consume data from the stream, without having to worry about missed records?
  2. Is there an alternative to the ShardIteratorRequest?
  3. If there is, how can I just "browse" the stream and see what's inside it for debugging references?
  4. What am I missing with the TRIM_HORIZON method?

Thanks in advance, I'd really love to learn a bit more about data consumption from a Kinesis stream.

Yuval Herziger
  • 1,145
  • 2
  • 16
  • 28
  • I too am having similar issues - though for me, I get duplicate records on each iteration (using both AT_SEQUENCE_NUMBER and FROM_SEQUENCE_NUMBER), despite using the NextShardIterator value from each response. The docs are somewhat cryptic on this issue.... I'd also love to know what "untrimmed" means (w.r.t TRIM_HORIZON). – Erve1879 Oct 13 '14 at 12:44
  • For the record, I did something difference in the mean time - I took an existing Scala consumer that listens to the stream continuously and just ported it back to pure Java for my purposes. Here's the Scala app, originally developed by SnowPlow https://github.com/snowplow/kinesis-example-scala-consumer – Yuval Herziger Oct 13 '14 at 14:09
  • Sadly, I'm not java-friendly.....! I just wish there was language-agnostic, clear guidelines on how to ensure idempotency and 100% "coverage" of records whilst permitting consumer restarts, crashes etc. It seems to negate the purpose of Kinesis if we have to save and check against the SequenceNumber of all previously-fetched records to ensure no duplication. I'm sure I'm missing something though....... – Erve1879 Oct 13 '14 at 14:49
  • Did you try Amazon's own libraries? https://github.com/awslabs/amazon-kinesis-connectors https://github.com/awslabs/amazon-kinesis-client These libraries (especially connector) handles all the cumbersome stuff like pinpointing the checkpoint, continue processing the shard, etc. – az3 Oct 14 '14 at 13:49
  • I actually did, last night. I just needed the time to investigate the ins and outs of Kinesis, KCL is an amazing library. I'll soon answer my own question here, turns out it was all about the checkpoints. – Yuval Herziger Oct 15 '14 at 08:24
  • 1
    I get similar issues using the JSON api without KCL. I want to get the last record as a checkpoint. LATEST gives me an empty array. TRIM_HORIZON gives me 8 records at present. I could iterate through all the records (could be thousands) to get the last one, but that seems ridiculous. How is latest supposed to work? Whatever KCL is doing, it should be using the very same API, saying "use KCL" isn't answering the question, and its checkpointing should only be based on this API and stored results. – Buzzware Mar 21 '15 at 00:15

2 Answers2

6

I understand the confusion above, and I had the same issues, but I think I've figured it out now. Note that I am using the JSON API directly without KCL.

I seems that the API gives clients 2 basic choices of iterators when they begin consuming a stream :

A) TRIM_HORIZON: for reading PAST records delayed between many minutes (even hours) and 24 hours old. It doesn't return recently put records. Using AFTER_SEQUENCE_NUMBER on the last record seen by this iterator returns an empty array even when records have been recently PUT.

B) LATEST: for reading FUTURE records in real time (immediately after they are PUT). I was tricked by the only sentence of documentation I could find on this "Start reading just after the most recent record in the shard, so that you always read the most recent data in the shard." You were getting an empty array because no records had been PUT since getting the iterator. If you get this type of iterator, and then PUT a record, that record will be immediately available.

Lastly, if you know the sequence id of a recently put record, you can get it immediately using AT_SEQUENCE_NUMBER, and you can get later records using AFTER_SEQUENCE_NUMBER even though they wont appear to a TRIM_HORIZON iterator.

The above does mean that if you want to read all known past records and future records in real time, you have to use a combination of A and B, with logic to cope with the records in between (the recent past). The KCL may well smooth over this.

Buzzware
  • 160
  • 2
  • 7
1

A lot of time has passed and maybe Kinesis bugs that may have once existed have since been resolved.

Offering a little visualisation:

oldest-records          <-- time -->             newest-records
|<-- TRIM_HORIZON             |<-- AT_SEQUENCE_NUMBER(n+15)   |<-- LATEST
n                             n+15                          n+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ?
                                n+15+1                        eos
                                |<--AFTER_SEQUENCE_NUMBER(n+15)

Where n is the sequence-number of the oldest record for the corresponding shard.

  • TRIM_HORIZON and LATEST ought to be self-explanatory
    • Perhaps EARLIEST would have been more intuitive than TRIM_HORIZON
    • LATEST could be considered synonymous to
      • AFTER_SEQUENCE_NUMBER for n+30
      • AT_SEQUENCE_NUMBER for eos.
  • The choice between AFTER_SEQUENCE_NUMBER vs AT_SEQUENCE_NUMBER I imagine would be determined by if you have vs have not, respectively, already processed the record at that sequence number

With sound use of the respective APIs (i.e. no-PEBKAC), I'd expect TRIM_HORIZON to return everything currently available.

Darren Bishop
  • 2,379
  • 23
  • 20