How to determine "preferred location" for partitions of PySpark dataframe?

Question

I'm trying to understand how coalesce determines how to join initial partitions into final questions, and apparently the "preferred location" has something to do with it.

According to this question, Scala Spark has a function preferredLocations(split: Partition) that can identify this. But I'm not at all familiar with the Scala side of Spark. Is there a way to determine the preferred location of a given row or partition ID at the PySpark level?

score 2 · Accepted Answer · answered Jun 15 '18 at 12:45

Yes, it is theoretically possible. Example data to force some form of preference (there could be a simpler example):

rdd1 = sc.range(10).map(lambda x: (x % 4, None)).partitionBy(8)
rdd2 = sc.range(10).map(lambda x: (x % 4, None)).partitionBy(8)

# Force caching so downstream plan has preferences
rdd1.cache().count()

rdd3 = rdd1.union(rdd2)

Now you can define a helper:

from pyspark import SparkContext

def prefered_locations(rdd):
    def to_py_generator(xs):
        """Convert Scala List to Python generator"""
        j_iter = xs.iterator()
        while j_iter.hasNext():
            yield j_iter.next()

    # Get JVM
    jvm =  SparkContext._active_spark_context._jvm
    # Get Scala RDD
    srdd = jvm.org.apache.spark.api.java.JavaRDD.toRDD(rdd._jrdd)
    # Get partitions
    partitions = srdd.partitions()
    return {
        p.index(): list(to_py_generator(srdd.preferredLocations(p)))
        for p in partitions
    }

Applied:

prefered_locations(rdd3)

# {0: ['...'],
#  1: ['...'],
#  2: ['...'],
#  3: ['...'],
#  4: [],
#  5: [],
#  6: [],
#  7: []}

This code runs without errors on my RDD and returns the expected number of partitions, but all have an empty list. Can I take this to mean that my partitions don't actually have any preferred location information? Or could this be a bug? — abeboparebop, Jun 15 '18 at 13:19
Many RDDs won't have preferred locations at all (that's why this quite elaborate example). Even in the above example only some partitions (I believe this is a result of partitioner aware union) have preferred locations. If you use sources which support data locality constraints it should more obvious. — Alper t. Turker, Jun 15 '18 at 14:27

How to determine "preferred location" for partitions of PySpark dataframe?

1 Answers1

Linked