67

Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them.

Moreover, I would like to get the use cases suitable for each of them.

user4157124
  • 2,809
  • 13
  • 27
  • 42

3 Answers3

106

Apache Spark is actually built on Akka.

Akka is a general purpose framework to create reactive, distributed, parallel and resilient concurrent applications in Scala or Java. Akka uses the Actor model to hide all the thread-related code and gives you really simple and helpful interfaces to implement a scalable and fault-tolerant system easily. A good example for Akka is a real-time application that consumes and process data coming from mobile phones and sends them to some kind of storage.

Apache Spark (not Spark Streaming) is a framework to process batch data using a generalized version of the map-reduce algorithm. A good example for Apache Spark is a calculation of some metrics of stored data to get a better insight of your data. The data gets loaded and processed on demand.

Apache Spark Streaming is able to perform similar actions and functions on near real-time small batches of data the same way you would do it if the data would be already stored.

UPDATE APRIL 2016

From Apache Spark 1.6.0, Apache Spark is no longer relying on Akka for communication between nodes. Thanks to @EugeneMi for the comment.

hveiga
  • 6,725
  • 7
  • 54
  • 78
  • 2
    By reactive I meant your application will be event-driven and it will __react__ to eventd. In the case of Akka these events are sent through messages across the actors. By resilient I meant your application will tolerate failures and it will be able to recover from them. Akka goes after the philosophy of 'let it crash'. You can read more here: http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html – hveiga Apr 13 '15 at 15:01
  • what about akka streams? is it a competitor to spark streaming? – Jas Jun 23 '15 at 10:54
  • 13
    I believe that as of Spark 1.6 Spark no longer uses Akka - Akka was replaced by Netty. Regardless, Spark used Akka only for communicating between nodes, not processing. – EugeneMi Apr 05 '16 at 22:18
  • Hi @EugeneMi, you are right. I will update the answer accordingly. – hveiga Apr 06 '16 at 14:30
  • @hveiga, just a comment about akka, which i copied from akka documentation: "Akka is a toolkit, not a framework: you integrate it into your build like any other library without having to follow a particular source code layout" – soMuchToLearnAndShare Oct 31 '16 at 03:44
  • 1
    I think this is a good answer, but could be expanded a bit: All this is not as much about choosing Akka *vs* Spark, actually, once you know the above (answer). Rather, the two are really good at complementing each other. With Akka, you get a *globally state-free, dynamic* cluster of operators. With Spark, you get a *globally state-full, static* operator graph. So you build your reactive infra around Akka and then use Spark to add specialized processing components (aggregators, extractors, machine learning, ...) to it. – fnl Oct 26 '17 at 10:31
31

Spark is for data processing what Akka is to managing data and instruction flow in an application.

TL;DR

Spark and Akka are two different frameworks with different uses and use cases.

When building applications, distributed or otherwise, one may need to schedule and manage tasks through a parallel approach such as by using threads. Imagine a huge application with lots of threads. How complicated would that be?

TypeSafe's (now called Lightbend) Akka toolkit allows you to use Actor systems (originally derived from Erlang) that gives you an abstraction layer over threads. These actors are able to communicate with each other by passing anything and everything as messages, and do things parallel and without blocking other code.

Akka gives you a cherry on the top by providing you ways to run the Actors in a distributed environment.

Apache Spark, on the other hand, is a data processing framework for massive datasets that cannot be handled manually. Spark makes use of what we call an RDD (or Resilient Distributed Datasets) which is distributed list like abstraction layer over your traditional data structures so that operations could be performed on different node parallel to each other.

Spark makes use of the Akka toolkit for scheduling jobs between different nodes.

Chetan Bhasin
  • 3,503
  • 1
  • 23
  • 36
  • 1
    The Actor System doesn't come from Erlang. It is the mathematical model behind it. Erlang was developped in Ericson using the Actor Model as a model. Akka wanted to do the same but on the JVM. – Ismail H Sep 10 '18 at 12:00
1

The choice between Apache Spark, Akka, or Kafka is heavily bent towards the use case (in particular the context and background of the services to be designed) in which they are being deployed. Some of the factors include Latency, Volume, 3rd party integrations, and the nature of the processing required (like batch or streaming, etc.). I found this resource to be of particular help - https://conferences.oreilly.com/strata/strata-ca-2016/public/schedule/detail/47251