I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.
I have the following two text files:
dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)
First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.
val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")
val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")
Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.
Next I read the parquet files and join them, I took the following approaches:
Approach 1: Basic spark join operation
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This however didn't perform well and shuffled over 1 TB of data.
Approach 2: Broadcast hash join
Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
broadcast(dfPerson).as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?
Approach 3: Bucket and sort the dataframe
I first try to bucket and sort the dataframes before writing to parquet and then join:
val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfPerson")
.saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")
val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfWorld")
.saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
With the following execution plan:
== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
:- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
: +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
: +- *(1) Filter isnotnull(PersonId#71)
: +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
+- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4
Also this didn't perform well.
To conclude
All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:
I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.
Any help would be greatly appreciated.