StackOverflowError while calling collectToPython when running Databricks Connect

Question

I am running a PySpark application on a remote cluster with DataBricks Connect. I'm facing a problem when trying to retrieve the minimum value of a column when another column has a certain value. When running the following line:

feat_min = df.filter(df['target'] == 1).select(
            F.min(F.col('feat')).alias('temp')).first().temp

I am getting this error:

Exception has occurred: Py4JJavaError
An error occurred while calling o5043.collectToPython.
: java.lang.StackOverflowError
    at scala.collection.TraversableLike.builder$1(TraversableLike.scala:233)
    at scala.collection.TraversableLike.map(TraversableLike.scala:237)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.immutable.List.map(List.scala:298)

The Java stacktrace is very long but not informative at all. Similarly the Python stacktrace only points to the line where it fails and doesn't provide any useful information.

The dataframe is very small, 1000 lines or less. When running the code directly on the same cluster the problem doesn't happen. When running it locally in a different conda environment with PySpark installed it also doesn't happen.

I saw this question and changed maxResultSize as recommended. I tried both 10g and 0 (unlimited) to no avail.

I think it should have something to do with the Spark configuration on my local machine but other than maxResultSize I haven't changed anything from the defaults installed by Databricks Connect. BTW, DB Connect is installed in a separate conda environment with no PySpark present, as per instructions. I've got Python 3.8.10 running on both my local machine and on the cluster and I've got the correct DB Connect version installed for my DBR.

Here is my Spark config if it's any help:

('spark.app.startTime', '1637931933606')
('spark.sql.catalogImplementation', 'in-memory')
('spark.driver.host', '192.168.0.36')
('spark.app.name', 'project')
('spark.executor.id', 'driver')
('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
('spark.sql.warehouse.dir', 'file:/home/user/project/spark-warehouse')
('spark.rdd.compress', 'True')
('spark.app.id', 'local-1637931934443')
('spark.serializer.objectStreamReset', '100')
('spark.driver.maxResultSize', '0')
('spark.master', 'local[*]')
('spark.submit.pyFiles', '')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
('spark.driver.port', '45897')
('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')

Thanks in advance for any input, I'm still pretty new to Spark and having DB Connect working properly would be a godsend.

score 0 · Answer 1 · answered Nov 26 '21 at 13:59

0

Please try aggregate first, please test below code:

feat_min = df.filter(df['target'] == 1).agg(F.min(F.col('feat'))).first()[0]

answered Nov 26 '21 at 13:59

Hubert Dudek

1,666
1
13
21

Thanks for the input Hubert but it elicits the same problem. As I said, the code works on the cluster directly and it works locally. I suspect it's more to do with the Spark configuration on my machine. – Mircea Stoica Dec 01 '21 at 10:46

StackOverflowError while calling collectToPython when running Databricks Connect

1 Answers1