5

I'm trying to use Pandas UDFs (a.k.a. Vectorized UDFs) in Apache Spark 2.4.0 on macOS 10.14.3 (macOS Mojave).

I installed pandas and pyarrow using pip (and later pip3).

Whenever I execute the sample code from the official documentation of Spark SQL I get the following exception.

import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

x = pd.Series([1, 2, 3])
print(multiply_func(x, x))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()

The exception is as follows:

objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called.
objc[97883]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
19/03/27 15:01:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:486)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:475)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:34)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:178)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
    at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:128)
    ...
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:159)
    ... 28 more
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

1 Answers1

8

I found a solution in Doesn't work on macOS High Sierra #69 and thought I'd post it on StackOverflow.


You should make sure that Xcode's command line tools are already installed. If not, execute the following:

xcode-select --install

What turned out very important was to export OBJC_DISABLE_INITIALIZE_FORK_SAFETY env var:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

With the two above the code worked fine:

>>> # Execute function as a Spark vectorized UDF
... df.select(multiply(col("x"), col("x"))).show()
[Stage 0:>                                                          (0 + 1) / 1]/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:159: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
+-------------------+
|multiply_func(x, x)|
+-------------------+
|                  1|
|                  4|
|                  9|
+-------------------+
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Perhaps a dump question, but you link to a ruby project. Why is it important when using pandas udfs to set this environment variable on macOS High Sierra. And thumps up for the answer. – tpain Mar 02 '20 at 16:34
  • May I ask you to point me to "you link to a ruby project"? Where's this "ruby"? – Jacek Laskowski Mar 03 '20 at 09:54
  • The link [Doesn't work on macOS High Sierra #69](https://github.com/rtomayko/shotgun/issues/69) is to a thread in a GitHub project that uses the Ruby programming language. Thats what I meant by "Ruby project". I just don't see how you connected that thread with Pandas UDF's :) – tpain Mar 03 '20 at 11:19
  • Ah, right! Because of `objc[97883]: +[__NSPlaceholderDictionary initialize]` in the logs that led to https://github.com/rtomayko/shotgun/issues/69#issuecomment-426839438. Agree? – Jacek Laskowski Mar 03 '20 at 15:08
  • 1
    Ahh ok, got it. Thanks again! :) – tpain Mar 03 '20 at 15:31
  • 1
    Important to mention: if you're running it from inside an IDE, you should set the environment variable, of course, in the IDE. This is what was not working for me. – Felipe Martins Melo Apr 10 '23 at 17:45