Architecture for stream analytics. Which broker I need?

Question

for research purpose I'm studying an architecture to do real-time (and also offline) data analytics and semantic annotation. I've attached a basic schema: I have some sensors linked to a raspberry pi 3. I suppose can handle this link with a mqqt broker like mosquitto. However, I want to collect data on raspberry, do something, and forward them to a cluster of commodity hardware to perform real time reasoning with Spark or Storm (any hint about which?). Then these data have to be stored in a NoSql db (Cassandra or HBase probably) accessible to an Hadoop cluster to execute batch reasoning, semantic data enrichment on them and re-store on same db. Therefore clients can query system to extract useful informations.

Which technology should I use in the red block? My idea is for MQQT but Kafka maybe could fit better my purposes?

it depends on the volume of your data and usecase type. Spark streaming has flawless integration with sources like flume, kafka, you can read more [here](http://spark.apache.org/docs/latest/streaming-programmingguide.html#advanced-sources). To start you can try RASBERRY PI->Kafka->spark streaming. — Rahul Sharma, Apr 27 '17 at 17:56
Your URL doesn't work. For me it's unclear the real potential of Kafka, someone says it is useful because can handle a massive amount of data. So Kafka provides also storage? In this way can I avoid a noSql db? The scenario sees some wearable sensors (6 or 7) transmitting at continuously at 20Hz. — Akinn, Apr 27 '17 at 19:32
try this- http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources — Rahul Sharma, Apr 27 '17 at 23:11
I have used kafka as Messaging System for my streaming application. The kafka was intermediate layer between Golden Gate and spark streaming application. Kafka can handle massive data loads(millions of records per seconds). I would suggest you to use system like spark streaming to consume messages from kafka topic then store in Nosql DB (Hbase, cassandra). — Rahul Sharma, Apr 28 '17 at 00:03

score 5 · Accepted Answer · answered Apr 28 '17 at 05:47

Spark vs Storm

Spark is the clear winner right now between Spark and Storm. At least one reason is that Spark is much more capable of handling large data volumes in a performant way. Storm struggles with processing large volumnes of data at a high velocity. For the most part the Big data community has embraced Spark, at least for now. Other technologies like Apex, and Kafka Streams are making waves in the Stream Processing space.

Kafka Producing to Raspberry Pi

If you choose the Kafka path keep in mind that the Java client for Kafka is by far, in my experience, most reliable implementation. However, I would do a proof of concept to ensure that there won't be any memory issues since the Rasberry Pi doesn't have a lot of RAM on it.

Kafka At the Heart

Keeping Kafka in your RED box will give you a very flexible architecture moving forward because any process: Storm, Spark, Apex, Kafka Streams, Kafka Consumer can connect to Kafka and quickly read the data. Having Kafka at the heart of your architecture provides you with a "distribution" point for all your data since its very fast but also allows for data to be permanently stored there. Keep in mind that you can't query Kafka, so using it will require you to simply read the messages as fast as you can to populate other datastores or to perform streaming calculations.

Thank you. So in your opinion putting kafka in red box and use spark is the best solution. I don't understand when you talk about memory issue of raspberry. I use it only to collect data from sensors and (I hope) doing real time anomaly detection, then publish on Kafka, so I don't think I saturate all its RAM memory. Another question: To collect data from sensors to raspberry I should use MQQT? — Akinn, Apr 28 '17 at 09:56
If you choose to use the Kafka Java client to produce messages from raspberry pi over to the Kafka Broker you should make sure that the raspberry pi has suffice memory to handle a java client and your specific load. I can't answer the question about MQQT, because I don't have any experience with that. — user2122031, Apr 30 '17 at 06:44

score 1 · Answer 2 · answered Sep 29 '17 at 05:30

You can evaluate Apache Apex for your use case as most of your requirements could be satisfied with it. Apache Apex also comes with Apache Malhar project which serves operator library for Apache Apex. Since you are deciding to use MQTT protocol, Apache Malhar also has prebuilt in AbstractMQTTInputOperator/AbstractMQTTOnputOperator which you can extend and it can serve as input broker. Malhar also comes with various operators which can connect to different NoSQL Dbs as well dumping to HDFS. Apache Apex may not require kafka in your proposed architecture. As you want to push data to Hadoop, being Hadoop native Apex actually can reduce our deployment efforts significantly.

Another interesting project I had come across is Apache Edgent which can help you to perform some real-time analytics at edge devices.

PS: I am a contributor to Apache Apex/Malhar project.

score 0 · Answer 3 · answered May 12 '17 at 12:10

What about using Apache Nifi?

There is an article describing the use case very similar to yours. To output your data to HDFS you can use PutHDFS or PutHiveQL, then use Hive LLAP to provide the access to the data for your clients.

Using Apache Nifi you can deliver working prototype very fast with zero (or maybe almost zero) development. Probably you will spend more time for performance tuning, deployment, and customization on productization step of your system, but this part is mandatory for any open source tool.

Architecture for stream analytics. Which broker I need?

3 Answers3