I am trying to use arrow by
enabling spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true"), but getting following error
/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame
attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to
true; however, failed by the reason below:
[Errno 13] Permission denied: '/local_disk0/spark-0419ce26-a5d1-4c8a-b985-
55ca5737a123/pyspark-f272e212-2760-40d2-9e6c-891f858a9a48/tmp92jv6g71'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to
true.
warnings.warn(msg)
/databricks/spark/python/pyspark/sql/pandas/conversion.py:161: UserWarning: toPandas
attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to
true, but has reached the error below and can not continue. Note that
'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in
the middle of computation.
arrow is not supported when using file-based collect
warnings.warn(msg)
Exception: arrow is not supported when using file-based collect
Our cluster version is 10.3 (includes Apache Spark 3.2.1, Scala 2.12), driver type is standard_E32_V3
below is the code which I have tried to use from documentation documentation link
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()