1

How can I determine the hostname of the machine holding a particular partition of an RDD?

I realize Spark does not intend to expose this information to casual users, but I'm trying to interface Spark with another system, and knowing the physical locations of the partitions would allow for more efficient transfers.

AatG
  • 685
  • 8
  • 23

2 Answers2

0

You can try and call foreachPartition on the RDD and get the hostname using system commands. Something like (in pyspark):

def f(iterator):
    log2file(gethostname)

rdd.foreachParition(f)

where log2file is some function to log to a file and gethostname is a regular system command to get the hostname.

If you want to get the result back as an RDD you can use mapPartitions as follows:

def f(iterator): yield hostname
rdd.mapPartitions(f).collect()
Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56
  • 1
    I'd say use `mapPartitionsWithIndex()` instead to give you a known partitionID for each partition. – Travis Hegner Dec 02 '16 at 18:10
  • Also, `.collect()` will return an `Array`, local to the driver, rather than an `RDD`. Likely, that is the desired behavior for the OP anyway. – Travis Hegner Dec 02 '16 at 18:12
0

Found a solution on another Stackoverflow question, How to get ID of a map task in Spark?. This information is available in the TaskContext object, which you can use like so:

import org.apache.spark.TaskContext

sc.parallelize(1 to 10, 3).foreachPartition(_ => {
    val ctx = TaskContext.get
    val stageId = ctx.stageId
    val partId = ctx.partitionId
    val hostname = ctx.taskMetrics.hostname
    println(s"Stage: $stageId, Partition: $partId, Host: $hostname")
})
Community
  • 1
  • 1
AatG
  • 685
  • 8
  • 23