First of all, it is great that you have thought through the different approaches.
There are a few questions which you should ask before coming to a good design:
- Timelines? Spark -> ES is a breeze and is recommended if you are starting on a PoC.
- Operational bandwidth? introducing more components will increase operational concerns. From my personal experiences, making sure you spark streaming job is stable is itself a time-consuming job. You want to add Kafka as well, so you need to spend more time in trying to get the monitoring, other ops concerns right.
- Scale? If it is going to take more scale, having a persistent message bus might be able to help absorb back-pressure and still scale pretty well.
If I had the time and dealing with large scale, Spark streaming -> Kafka -> ES looks to be the best bet. This way when your ES cluster is unstable, you still have the option of Kafka replay.
I am a little hazy on Kafka -> HDFS -> ES, as there could be performance implications on adding a batch layer in between the Source and Sink. Also honestly, I am not aware of how good logstash is with HDFS, so can't really comment.
Tight coupling is a oft-discussed subject. There are people who argue against it citing reusability concerns, but there are also people who argue for it, as sometimes it can create a simpler design and makes the whole system easier to reason about. Also talk about premature optimisations :) We have had successes with Spark -> ES directly at a moderate scale of data inflow. So don't discount the power of a simpler design just like that :)