0

I want to store realtime tweet following some filtering criteria in a MySQL database. I want to understand which approach is better given the fact that i have a 16 CPU machine. Since for my case is better to use the streaming api it's possible to easily build a java application using tweet4j library; In this case filtering and storing can be done using multithreading programming. On the other hand i just discovered Spark that with few line permit to do the same but remain the bottleneck of having only one memory.

I want to understand if spark could be a real improvement given that it's pretty difficult to reach twitter rate limit and I can't take advantage of a distributed cluster.

Thanks for helping.

LuigiDB
  • 72
  • 9
  • 1
    I don't get when you say "Spark that with few line permit to do the same but remain the bottleneck of having only one memory". Spark can run on a cluster as well. – Ramkumar Venkataraman Sep 09 '16 at 10:18
  • I know but i only have access to one machine with 16 cores i can't use it over a cluster. – LuigiDB Sep 09 '16 at 14:25
  • 1
    Got it, the wording of your question confused me a bit! If it is a single box, I am not too sure if you would want the overhead of running Spark, as Spark really shines when you give it a cluster and it can help prune massive data. If it is a single machine, you could only try multi-threaded model (give Akka a try if you like Scala and Erlang actor model) – Ramkumar Venkataraman Sep 09 '16 at 16:30
  • Thanks for the informations. Very helpful – LuigiDB Sep 10 '16 at 07:03

0 Answers0