I am running a PySpark application on a remote cluster with DataBricks Connect. I'm facing a problem when trying to retrieve the minimum value of a column when another column has a certain value. When running the following line:
feat_min = df.filter(df['target'] == 1).select(
F.min(F.col('feat')).alias('temp')).first().temp
I am getting this error:
Exception has occurred: Py4JJavaError
An error occurred while calling o5043.collectToPython.
: java.lang.StackOverflowError
at scala.collection.TraversableLike.builder$1(TraversableLike.scala:233)
at scala.collection.TraversableLike.map(TraversableLike.scala:237)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
The Java stacktrace is very long but not informative at all. Similarly the Python stacktrace only points to the line where it fails and doesn't provide any useful information.
The dataframe is very small, 1000 lines or less. When running the code directly on the same cluster the problem doesn't happen. When running it locally in a different conda environment with PySpark installed it also doesn't happen.
I saw this question and changed maxResultSize
as recommended. I tried both 10g and 0 (unlimited) to no avail.
I think it should have something to do with the Spark configuration on my local machine but other than maxResultSize
I haven't changed anything from the defaults installed by Databricks Connect. BTW, DB Connect is installed in a separate conda environment with no PySpark present, as per instructions. I've got Python 3.8.10 running on both my local machine and on the cluster and I've got the correct DB Connect version installed for my DBR.
Here is my Spark config if it's any help:
('spark.app.startTime', '1637931933606')
('spark.sql.catalogImplementation', 'in-memory')
('spark.driver.host', '192.168.0.36')
('spark.app.name', 'project')
('spark.executor.id', 'driver')
('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
('spark.sql.warehouse.dir', 'file:/home/user/project/spark-warehouse')
('spark.rdd.compress', 'True')
('spark.app.id', 'local-1637931934443')
('spark.serializer.objectStreamReset', '100')
('spark.driver.maxResultSize', '0')
('spark.master', 'local[*]')
('spark.submit.pyFiles', '')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
('spark.driver.port', '45897')
('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
Thanks in advance for any input, I'm still pretty new to Spark and having DB Connect working properly would be a godsend.