Why do Apache Spark nodes need access to datafile path?

Asked Apr 29 '21 at 15:20

Active Apr 29 '21 at 15:20

Viewed 170 times

I am new to Apache Spark. I have a cluster with a master and one worker. I am connected to master with pyspark (all are on Ubuntu VM).

I am reading this documentation: RDD external-datasets

in particular I have executed:

distFile = sc.textFile("data.txt")

I understand that this creates an RDD from file, which should be managed by the driver, hence by pyspark app. But the doc states:

If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

Question is why do workers need access to the file path if the RDD is created by the driver only (afterwards distributed to the nodes)?

asked Apr 29 '21 at 15:20

toto'

1,325
1
17
36

Look here https://stackoverflow.com/questions/27299923/how-to-load-local-file-in-sc-textfile-instead-of-hdfs – thebluephantom Apr 29 '21 at 18:42
thanks @thebluephantom. I understand that is required to have the file on all nodes but I do not understand why. – toto' Apr 29 '21 at 20:02
Well neither do I but it is not a typical real life approach to big data. – thebluephantom Apr 29 '21 at 20:13

Why do Apache Spark nodes need access to datafile path?

0 Answers0