Spark read file from each node similar to Hadoop's DistribuitedCache

Question

I have a file in master node that should be read by each node. How can I make this possible? In Hadoop's MapReduce I used the

DistribuitedCache.getLocalCacheFiles(context.getConfiguration())

How Spark works for file sharing between nodes? Do I have to load file in RAM and broadcast variable? Or can I only to indicate (absolute?) path of file in SparkContext configuration and it becames instantly available for all nodes?

score 0 · Answer 1 · answered Apr 12 '17 at 08:23

0

You can use SparkFiles for reading files from distributed cache.

import org.apache.spark.SparkFiles
import org.apache.hadoop.fs.Path

sc.addFile("/path/to/file.txt")
val pathOnWorkerNode = new Path(SparkFiles.get("file.txt"))

answered Apr 12 '17 at 08:23

Sanchit Grover

998
1
6
9

score 0 · Answer 2 · edited May 23 '17 at 10:30

0

Look on spark-submit "files" parameter, for example, here:

Running Spark jobs on a YARN cluster with additional files

edited May 23 '17 at 10:30

Community

1
1

answered Apr 12 '17 at 08:35

pasha701

6,831
1
15
22

Spark read file from each node similar to Hadoop's DistribuitedCache

2 Answers2