Architecture of a real time streaming job

Question

I am working on an streaming application using Spark Streaming, I want to index my data into elastic search.

My analysis: I can directly push data from Spark to elastic search, but i feel in this case both the components will be tightly coupled.

If this is a spark core job, we can write that output to HDFS and use logstash to get the data from HDFS and push it to elastic search.

Solution according to me: I can push the data from Spark Streaming to Kafka and from Kafka we can read that data using Logstash and push to ES.

Please suggest.

what do you mean by tight coupling here, as by doing same task in two steps you may be hit back by much slower performance. I will say it is use case dependent, better you add more description about your problem. — Amit Kumar, Sep 29 '16 at 17:08
Even if ES cluster is down for some time, Kafka can hold the data and when ES cluster is stable again it can fetch that data. — kushagra mittal, Sep 30 '16 at 18:32
It is good idea to buffer data to Kafka before doing actual processing. You should tune log retention value to hold data for time of processing. Also I will say this architecture i.e Spark->Kafka first and then doing all the processing is much better and stable.You can actually cut out Spark checkpoinitng from picture and still have a fault tolerant system if you maintain kafka read offsets and processed offsets. — Amit Kumar, Oct 01 '16 at 06:48

score 1 · Accepted Answer · edited Sep 28 '16 at 09:55

First of all, it is great that you have thought through the different approaches.

There are a few questions which you should ask before coming to a good design:

Timelines? Spark -> ES is a breeze and is recommended if you are starting on a PoC.
Operational bandwidth? introducing more components will increase operational concerns. From my personal experiences, making sure you spark streaming job is stable is itself a time-consuming job. You want to add Kafka as well, so you need to spend more time in trying to get the monitoring, other ops concerns right.
Scale? If it is going to take more scale, having a persistent message bus might be able to help absorb back-pressure and still scale pretty well.

If I had the time and dealing with large scale, Spark streaming -> Kafka -> ES looks to be the best bet. This way when your ES cluster is unstable, you still have the option of Kafka replay.

I am a little hazy on Kafka -> HDFS -> ES, as there could be performance implications on adding a batch layer in between the Source and Sink. Also honestly, I am not aware of how good logstash is with HDFS, so can't really comment.

Tight coupling is a oft-discussed subject. There are people who argue against it citing reusability concerns, but there are also people who argue for it, as sometimes it can create a simpler design and makes the whole system easier to reason about. Also talk about premature optimisations :) We have had successes with Spark -> ES directly at a moderate scale of data inflow. So don't discount the power of a simpler design just like that :)

Thanks! I can deal with scaling and timelines. – kushagra mittal Sep 28 '16 at 13:35 — kushagra mittal, Sep 28 '16 at 13:35

Architecture of a real time streaming job

1 Answers1