Updating Kafka Event Log

Question

I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.

The issue is that I get data from three separate page events:

When the page is requested.
When the page is loaded
When the page is unloaded

These events fire at different times (all usually within a few seconds of each other, but up to minutes/hours away from each other).

I want to eventually store a single event about a web page view in my data warehouse. For example, a single log entry as follows:

pageid=abcd-123456-abcde, site='yahoo.com' created='2015-03-09 15:15:15' loaded='2015-03-09 15:15:17' unloaded='2015-03-09 15:23:09'

How should I partition Kafka so that this can happen? I am struggling to find a partition scheme in Kafka that does not need a process using a data store like Redis to temporarily store data while merging the CREATE (initial page view) and UPDATE (subsequent load/unload events).

score 1 · Accepted Answer · answered Mar 09 '15 at 21:54

1

Assuming:

you have multiple interleaved sessions
you have some kind of a sessionid to identify and correlate separate events
you're free to implement consumer logic
absolute ordering of merged events are not important

wouldn't it then be possible to use separate topics with the same number of partitions for the three kinds of events and have the consumer merge those into a single event during the flush to S3?

As long as you have more than one total partition you would then have to make sure to use the same partition key for the different event types (e.g. modhash sessionid) and they would end up in the same (per topic corresponding) partitions. They could then be merged using a simple consumer which would read the three topics from one partition at a time. Kafka guarantees ordering within partitions but not between partitions.

Big warning for the edge case where a broker goes down between page request and page reload though.

answered Mar 09 '15 at 21:54

Lundahl

6,434
1
35
37

Thanks for your thoughts - so the two options that I see are: 1) create a topic for each event, and then read from each when writing out to S3; 2) create a single topic that writes to a data store like Redis (where values can merge) and then pull data from there before writing to S3. Would you agree? – Scott Switzer Mar 09 '15 at 23:03
With option 1 I guess you mean a topic per event type (3 topics in total) and not per event...although this wording made me think of creating a topic for every "session" (which would also work in theory but sounds extremely scary from a scalability perspective in practise). The key point with option 1 being that you could get an ordered consumption as long as you know that the 3 event types would be located in the same partition number for those 3 topics by using a common partition key. – Lundahl Mar 10 '15 at 21:29
Option 2 was something you'd like to avoid if possible if I've understood correctly but if you could introduce another layer, then there would be many options (Storm/Samza/Redis etc.) – Lundahl Mar 10 '15 at 21:33

Updating Kafka Event Log

1 Answers1