I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors on the same node? How is this able to share data between the JVMs of the executors running on the same node?
-
How is the data pinned to the executor? Could you describe the problem you're trying to solve? – maasg Oct 22 '16 at 09:59
-
1Basically, I have a read only data which is around 6 GB. This data must be read by each executor from time to time, as its a sort of a lookup table. Each executor must have access to the whole lookup table. I don't want to give that much memory to each executor. I want that memory to be shared between the executors running on the same node, so that I can get away by giving little memory to each executor. – pythonic Oct 22 '16 at 10:06
-
6Sounds like you could use some local service to do that. E.g. Load that data in a local Redis (or similar in-memory db/cache) and use an singleton JVM object from the Spark job to address the local instance. You will also need a managing service that does the refresh. I don't think there's a out-of-the-box Spark solution to achieve what you want. – maasg Oct 22 '16 at 10:42
-
What about broadcast variables. How do they work? Aren't they also shared between executors on the same node? – pythonic Oct 22 '16 at 10:56
-
4Broadcast variables allow sharing of data among tasks running on the same executor VM, so the data needs to be loaded only once per executor. – maasg Oct 22 '16 at 11:08
2 Answers
Yes, you could use broadcast variables when considering your data is readonly (immutable). the broadcast variable must satisfy the following properties.
- Fit in memory
- Immutable
- Distributed to the cluster
So, here the only condition is your data have to be able to fit in memory on one node. That means the data should NOT be anything super large or beyond the memory limits like a massive table.
Each executer receives a copy of the broadcast variable and all the tasks in that particular executor are reading/using that data. It's like sending a large, read-only data to all the worker nodes in the cluster. i.e., ship to each worker only once instead of with each task and executors (it's tasks) read the data.
-
-
@LostInOverflow I believe that the question creates some confusion. The OP isn't using the correct wording. 2 executors != 2 application JVMs – eliasah Oct 22 '16 at 16:27
-
Ok, yes. the answer for JVM sharing is, Apache Spark is a distributed data processing framework. So, here you can't share the jobs/applications/tasks or RDD's. The only way the data sharing is possible via a persistent storage like HDFS. Apache Ignite, a framework which provides an abstraction on top of RDD called IgniteRDD an implementation of native Spark RDD and DataFrame APIs which shares the state of the RDD across other jobs, applications and workers. – Kris Oct 22 '16 at 16:28
-
But each executor executes on a separate JVM, or no? And I was talking about executors running on the same node only. – pythonic Oct 22 '16 at 17:48
-
Yes, each executors running on a separate JVM. When running a spark application on a cluster or a single node, each application gets an independent set of executor JVMs that only run tasks for that application. It means that tasks from different applications run in different JVMs. it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system like HDFS. – Kris Oct 22 '16 at 19:06
-
2
-
I was talking about sharing data between executors run by the same Spark application. – pythonic Oct 22 '16 at 19:31
-
@pythonic: we can't share data between the executors and they are running on different on JVM's. – Kris Oct 22 '16 at 20:15
-
1As mentioned earlier, we can't share data between the executors and executors are running on different on JVM's. 2 executor threads always running on a same JVM for an application. – Kris Oct 22 '16 at 20:25
I assume you ask how executors can share mutable state. if you only need to share immutable data, then you can just refer to @Stanislav's answer.
if you need mutable state between executors, there are quite a few approaches:
- shared external FS/DB
- stateful streaming databricks doc
- mutable distributed shared cache Ignite RDD

- 1,668
- 19
- 24