Storage tweets using spark in a multicore cluster

Asked Sep 09 '16 at 09:40

Active Sep 09 '16 at 09:40

Viewed 26 times

I want to store realtime tweet following some filtering criteria in a MySQL database. I want to understand which approach is better given the fact that i have a 16 CPU machine. Since for my case is better to use the streaming api it's possible to easily build a java application using tweet4j library; In this case filtering and storing can be done using multithreading programming. On the other hand i just discovered Spark that with few line permit to do the same but remain the bottleneck of having only one memory.

I want to understand if spark could be a real improvement given that it's pretty difficult to reach twitter rate limit and I can't take advantage of a distributed cluster.

Thanks for helping.

asked Sep 09 '16 at 09:40

LuigiDB

1

I don't get when you say "Spark that with few line permit to do the same but remain the bottleneck of having only one memory". Spark can run on a cluster as well. – Ramkumar Venkataraman Sep 09 '16 at 10:18
I know but i only have access to one machine with 16 cores i can't use it over a cluster. – LuigiDB Sep 09 '16 at 14:25
1

Got it, the wording of your question confused me a bit! If it is a single box, I am not too sure if you would want the overhead of running Spark, as Spark really shines when you give it a cluster and it can help prune massive data. If it is a single machine, you could only try multi-threaded model (give Akka a try if you like Scala and Erlang actor model) – Ramkumar Venkataraman Sep 09 '16 at 16:30
Thanks for the informations. Very helpful – LuigiDB Sep 10 '16 at 07:03

Storage tweets using spark in a multicore cluster

0 Answers0