0

I am using spark to join a static dataset I fetch from azure storage and streaming dataset I get from eventhub. I have not used broadcast join anywhere. I tried df.explain() after joining, it shows sortmerge join is happening. I am not sure why I am getting error related to Broadcast Hash join.

java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
...
...
Exception in thread "spark-listener-group-shared" java.lang.OutOfMemoryError: Java heap space
...
...

does spark broadcast everything it gets from event hubs?

This is how my program looks like

//##read stream from event hub

process(stream)

def process(stream: DataFrame){

  val firstDataSet = getFirstDataSet()
  firstDataSet.persist()

  val joined = stream.join(
    firstDataSet,
    stream("joinId") === firstDataSet("joinId")
    )

  //##write joined to event hub
}

def getFirstDataSet(){
  //##read first from azure storage

  val firstDataSet = first.filter(
    condition1 &&
    condition 2
    )
}

update: It looks like JVM Out of Memory error and not related to broadcast. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24912

I tried to check the driver and executor heap usage after GC using gceasy.io: Executor looked good executor heap usage after GC Driver memory consumption after GC looks constantly increasing driver heap usage after GC

I analyzed heap dump and here are top 15 entries during Out Of Memory: enter image description here

Looks like char array is accumulating in the driver. I am not sure what might be causing it to accumulate the char array.

  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/204837/discussion-on-question-by-nagendra-ghimire-spark-java-lang-outofmemoryerror-not). – Samuel Liew Dec 25 '19 at 13:49

1 Answers1

1

I ran into this today and after analyzing a heap dump I found that this was due to Spark UI holding on to a lot of execution plans in memory: enter image description here

Lowering all setting with retained in the name solved this issue. https://spark.apache.org/docs/latest/configuration.html#spark-ui

Or if you don't need it, you can disable the spark ui entirely.

M.Vanderlee
  • 2,847
  • 2
  • 19
  • 16
  • I think we dont need the UI. I will check that. Are the strings on the side of char[] the values of the strings stored in the char array? can you please send the steps how I can see what are those values? does it require jvm licencing? I have used jmap to get the memory dump – Nagendra Ghimire May 18 '20 at 06:44
  • @NagendraGhimire I used [VisualVM](https://visualvm.github.io/) to load the heapdump and analyze it as seen in my screenshot. It's an opensource project from Oracle. – M.Vanderlee May 19 '20 at 16:33