0

I am using GCP/Dataproc for some spark/graphframe calculations.

In my private spark/hadoop standalone cluster, I have no issue using functools.partial when defining pysparkUDF.

But, now with GCP/Dataproc, I have an issue as below.

Here are some basic settings to check whether partial works well or not.

import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial

def power(base, exponent):
    return base ** exponent

In the main function, functools.partial works well in ordinary cases as we expect:

# see whether partial works as it is
square = partial(power, exponent=2)
print "*** Partial test = ", square(2)

But, if I put this partial(power, exponent=2) function to PySparkUDF as below,

testSquareUDF = F.udf(partial(power, exponent=2),T.FloatType())    
testdf = inputdf.withColumn('pxsquare',testSquareUDF('px'))

I have this error message:

Traceback (most recent call last):
  File "/tmp/bf297080f57a457dba4d3b347ed53ef0/gcloudtest-partial-error.py", line 120, in <module>
    testSquareUDF = F.udf(square,T.FloatType())
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1971, in udf
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1955, in _udf
  File "/opt/conda/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))

AttributeError: 'functools.partial' object has no attribute '__module__'

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [bf297080f57a457dba4d3b347ed53ef0] entered state [ERROR] while waiting for [DONE].

=========

I had no this kind of issue with my standalone cluster. My spark cluster version is 2.1.1. The GCP dataproc's is 2.2.x

Anyone can recognize what prevents me from passing the partial function to the UDF?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • I have found that **spark 2.2** has some issue on passing `functools.partial` to UDF. This was fixed with the release of **spark 2.3**. So, avoiding 2.2 if you use `partial` a lot. – Sungryong Hong Aug 02 '18 at 23:35
  • FYI Dataproc supports Spark 2.3 as well: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions. Just pass `--image-version=1.3` when creating your cluster. – Karthik Palaniappan Aug 03 '18 at 07:24

1 Answers1

1

As discussed in the comments, the issue was with spark 2.2. And, since spark 2.3 is also supported by Dataproc, just using --image-version=1.3 when creating the cluster fixes it.

rilla
  • 782
  • 6
  • 18