0

I am new to Apache Spark. I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).

I am reading this documentation: RDD external-datasets

in particular I have executed:

distFile = sc.textFile("data.txt")

I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app. But the doc states:

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?

toto'
  • 1,325
  • 1
  • 17
  • 36

0 Answers0