pandasUDF and pyarrow 0.15.0

Question

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are

java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
    at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
    at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
    at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
    at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
    at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:98)
    at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:96)
    at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)...

They all seem to happen in apply functions of a pandas series. The only change I found is that pyarrow has been updated on Saturday (05/10/2019). Tests seem to work with 0.14.1

So my question is if anyone know if this is a bug in the new updated pyarrow or is there some significant change that will make pandasUDF hard to use in the future?

score 24 · Accepted Answer · answered Oct 07 '19 at 16:08

24

It's not a bug. We made an important protocol change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.

Your options are

Set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 from where you are using Python
Downgrade to pyarrow < 0.15.0 for now.

Hopefully the Spark community will be able to upgrade to 0.15.0 in Java soon so this issue goes away.

This is discussed in http://arrow.apache.org/blog/2019/10/06/0.15.0-release/

answered Oct 07 '19 at 16:08

Wes McKinney

101,437
32
142
108

Where should I define the environment variable if I am using AWS EMR? I put it in software configuration Classification `spark-env.export`. But it does not seems to solve the problem – panc Jun 17 '21 at 22:31
So I chose to downgrade pyarrow to 0.14.0, it seems working. But still prefer to use the first method. – panc Jun 17 '21 at 22:57

score 2 · Answer 2 · answered Jun 25 '21 at 07:20

2

In Spark try the following appendix:

spark-submit --deploy-mode cluster --conf spark.yarn.appExecutorEnv.ARROW_PRE_0_15_IPC_FORMAT=1 --conf spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT=1 --conf spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT=1

answered Jun 25 '21 at 07:20

StanislavKo

363
3
8

pandasUDF and pyarrow 0.15.0

2 Answers2

Linked