Running Scala module with databricks-connect

Question

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.

I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...

However I think I did something wrong. When I tried to run my module I got the following error:

21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
    at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
    at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
    at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
    at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
    at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)

The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...

Any help is appreciated!

whoops maybe i should delete this b/c i just remembered one of the limitations is that databricks-connect can't do structured streaming... — steven hurwitt, Jul 18 '21 at 04:00

score 1 · Answer 1 · answered Jul 18 '21 at 07:04

The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.

P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

Running Scala module with databricks-connect

1 Answers1