Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host
- but not on the same host).
sc.textFile("hdfs://external_host/file.txt")
I understand that Spark schedules tasks based on the locality of the underlying RDD. But on which worker (ie. executor) is .textFile(..)
scheduled (since we do not have an executor running on external_host
)?
I imagine it loads the HDFS blocks as partitions to worker memory, but how does Spark decide what the best worker is (I would imagine it chooses the closest based on latency or something else, is this correct?)