Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas

Question

I searched for this kind of error, and I couldn't find any information on how to solve it. This is what I get when I execute the below two scripts:

org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory.

write.py

import pandas as pd
from pyspark.sql import SparkSession
from os.path import abspath

warehouse_location = abspath('spark-warehouse')

booksPD = pd.read_csv('books.csv')

spark = SparkSession.builder \
        .appName("MyApp") \
        .master("local[*]") \
        .config("spark.sql.execution.arrow.enabled", "true") \
        .config("spark.driver.maxResultSize", "16g") \
        .config("spark.python.worker.memory", "16g") \
        .config("spark.sql.warehouse.dir", warehouse_location) \
        .enableHiveSupport() \
        .getOrCreate()
spark.sparkContext.setLogLevel("WARN")

spark.createDataFrame(booksPD).write.saveAsTable("books")
spark.catalog.clearCache()

read.py

from pyspark.sql import SparkSession
from os.path import abspath

warehouse_location = abspath('spark-warehouse')

spark = SparkSession.builder \
        .appName("MyApp") \
        .master("local[*]") \
        .config("spark.sql.execution.arrow.enabled", "true") \
        .config("spark.driver.maxResultSize", "16g") \
        .config("spark.python.worker.memory", "16g") \
        .config("spark.sql.warehouse.dir", warehouse_location) \
        .enableHiveSupport() \
        .getOrCreate()
spark.sparkContext.setLogLevel("WARN")

books = spark.sql("SELECT * FROM books").toPandas()

score 1 · Answer 1 · answered Aug 20 '19 at 14:41

Most probably, one has to increase the memory limits. Appending the below configurations to increase the driver and executor memory solves the problem in my case.

.config("spark.driver.memory", "16g") \
.config("spark.executor.memory", "16g") \

Since the program is configured to run in local mode (.master("local[*]")), the driver will get some of the load too and will need enough memory.

Apache Arrow OutOfMemoryException when PySpark reads Hive table to pandas

1 Answers1