I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.
The issue is that I get data from three separate page events:
- When the page is requested.
- When the page is loaded
- When the page is unloaded
These events fire at different times (all usually within a few seconds of each other, but up to minutes/hours away from each other).
I want to eventually store a single event about a web page view in my data warehouse. For example, a single log entry as follows:
pageid=abcd-123456-abcde, site='yahoo.com' created='2015-03-09 15:15:15' loaded='2015-03-09 15:15:17' unloaded='2015-03-09 15:23:09'
How should I partition Kafka so that this can happen? I am struggling to find a partition scheme in Kafka that does not need a process using a data store like Redis to temporarily store data while merging the CREATE (initial page view) and UPDATE (subsequent load/unload events).