0

I have a file in master node that should be read by each node. How can I make this possible? In Hadoop's MapReduce I used the

DistribuitedCache.getLocalCacheFiles(context.getConfiguration())

How Spark works for file sharing between nodes? Do I have to load file in RAM and broadcast variable? Or can I only to indicate (absolute?) path of file in SparkContext configuration and it becames instantly available for all nodes?

Andrean
  • 313
  • 1
  • 4
  • 13

2 Answers2

0

You can use SparkFiles for reading files from distributed cache.

import org.apache.spark.SparkFiles
import org.apache.hadoop.fs.Path

sc.addFile("/path/to/file.txt")
val pathOnWorkerNode = new Path(SparkFiles.get("file.txt"))
Sanchit Grover
  • 998
  • 1
  • 6
  • 9
0

Look on spark-submit "files" parameter, for example, here:

Running Spark jobs on a YARN cluster with additional files

Community
  • 1
  • 1
pasha701
  • 6,831
  • 1
  • 15
  • 22