0

This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.

Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.

The metrics that should be available in data analytics are:

  • Current amount of items
  • Amount of items as graph over the last week
  • Amount of items per day as historical data

Now a proposal for an architecture for such a scenario would be like this:

  • Emit events to Kafka
  • Use kafka as short term storage
  • Use superset to display live data directly on kafka with presto
  • Use spark to consume kafka events to write aggregations to analytics Postgres db

Schematically it would look like this:

Application
   |
   | (publish events)
   ↓
Kafka     [topics: item_created, item_deleted]
|   ↑
|   | (query short-time)
|   |
|   Presto  ←-----------┐
|                       |
| (read event stream)   |
↓                       | 
Spark                   |
  |                     |
  | (update metrics)    |
  ↓                     |
 Postgres               |
     ↑                  |
     | (query)          | (query)
     |                  |
     └-----Superset-----┘

Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.

An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.

For historical data the idea could be to create a table Items that with the schema:

--------------------------------------------
| Items                                    |
--------------------------------------------
| timestamp               | numberOfItems  |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0              |
| 2021-11-17 09:00:00.000 | 20             |
| 2021-11-18 09:00:00.000 | 5              |
| 2021-11-19 09:00:00.000 | 7              |
| 2021-11-20 09:00:00.000 | 14             |

Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.

The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?

Is this an intended use case for spark? What would be a good alternative for this problem?

In terms of data-throughput assume like 1000-10000 of this events per day.

Update:

  • Apparently spark is not intended to be used like this as it can be seen from this issue.
Sjoerd222888
  • 3,228
  • 3
  • 31
  • 64

1 Answers1

0

Apparently spark is not intended to be used like this

You don't need Spark, or at least, not completely.

Kafka Streams can be used to move data between various Kafka topics.

Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.

Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)

assume like 1000-10000 of this events per day

Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245