I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard. Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard. Confused list:
- Should I run Spark on Hadoop cluster or local file system ?
- Is standalone mode of Spark can fulfill my requirements ?
- Is my approach is appropriate or what should be best in this case ?