3

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host).

sc.textFile("hdfs://external_host/file.txt")

I understand that Spark schedules tasks based on the locality of the underlying RDD. But on which worker (ie. executor) is .textFile(..) scheduled (since we do not have an executor running on external_host)?

I imagine it loads the HDFS blocks as partitions to worker memory, but how does Spark decide what the best worker is (I would imagine it chooses the closest based on latency or something else, is this correct?)

zero323
  • 322,348
  • 103
  • 959
  • 935
Joe
  • 31
  • 1
  • Closely related to [How YARN knows data locality in Apache spark in cluster mode](https://stackoverflow.com/q/49944424/6910411) – zero323 Apr 25 '18 at 14:32
  • Thanks, this talks about `getPreferredLocation` which I understand is used for launching tasks on an existing RDD. Is this also what is used for the creation of a new (Hadoop)RDD with `textFile()`? And in that case, what happens if there is no executor running on the remote HDFS host? – Joe Apr 25 '18 at 14:38
  • To add, I see that for `HadoopRDD` the `getPreferredLocation` comes from HDP `InputSplit`'s `getLocationInfo`, but again, what if we do not have an executor on that host? – Joe Apr 25 '18 at 14:46
  • If you don't have nodes on the same host or rack the best you can get is ANY, I don't think (but I don't have code or docs to back it up at the moment) there is any special optimization involved in that case. Might vary from manger to manager so you should add manger tag. – zero323 Apr 25 '18 at 14:49

0 Answers0