Spark dataframe Join issue

Question

Below code snippet works fine. (Read CSV, Read Parquet and join each other)

//Reading csv file -- getting three columns: Number of records: 1
 df1=spark.read.format("csv").load(filePath) 

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 30 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)

//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
 df1=read from Hbase code
 // It read from Hbase properly and able to show one record.
 df1.show

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 50 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.

If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below

//Reading from Hbase -- getting three columns: Number of records: 1
 df1=read from Hbase code

**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 50 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.

My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.

score 2 · Answer 1 · answered Mar 11 '19 at 00:03

When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.

thus it tries to sort and shuffle the data across the executors.

to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.

however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.

example:

df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
    ,lit(hbaseValueCol1))
 .otherwise(lit(null))

Thanks for replying. Just an an info. The hbase table might grow in the future but not more than 100 rows. I could not follow on the hbaseKey. Can you please bit more explain the case expression which you are mentioning. — Ansip, Mar 11 '19 at 03:14

Raphael Roth · Answer 2 · 2019-03-11T14:53:21.603

I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1

score 0 · Answer 3 · answered Mar 11 '19 at 11:33

When I have problem with memory during a join it usually means one of two reasons:

You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.

Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.

Solution: Repartition your tables before the join.

Ad. 2 Maybe not the case here but worth checking.

Why should number of partitions matter in a Broadcast join? And what should that have to do with driver memory? — Raphael Roth, Mar 11 '19 at 14:52

Spark dataframe Join issue

3 Answers3