0

I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.

I have the following two text files:

dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)

First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.

val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")

val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")

Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.

Next I read the parquet files and join them, I took the following approaches:

Approach 1: Basic spark join operation

val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
    dfPerson.as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

This however didn't perform well and shuffled over 1 TB of data.

Approach 2: Broadcast hash join

Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem

val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
    broadcast(dfPerson).as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?

Approach 3: Bucket and sort the dataframe

I first try to bucket and sort the dataframes before writing to parquet and then join:

val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
  .write
  .format("parquet")
  .partitionBy("GroupId")
  .bucketBy(4,"PersonId")
  .sortBy("PersonId")
  .mode("overwrite")
  .option("path", "output/dfPerson")
  .saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")

val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
  .write
  .format("parquet")
  .partitionBy("GroupId")
  .bucketBy(4,"PersonId")
  .sortBy("PersonId")
  .mode("overwrite")
  .option("path", "output/dfWorld")
  .saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")

dfWorld.as("w").join(
    dfPerson.as("p"),
    $"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
    "right"
  )
  .drop($"w.GroupId")
  .drop($"w.PersonId")

With the following execution plan:

== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
   :- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
   :     +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
   :        +- *(1) Filter isnotnull(PersonId#71)
   :           +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
   +- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
         +- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4

Also this didn't perform well.

To conclude

All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:

DAG visualization

I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.

Any help would be greatly appreciated.

  • Hi, when I have such problems I'm using spark.conf.set("spark.sql.shuffle.partitions", 3200) and spark.conf.set("spark.default.parallelism", 3200) after that im using .repartitionBy($"w.GroupId",$"w.PersonId") and .sortWithInPartitions for input dataframes Have you checked if your tasks are hanging or remains to last task ? – M. Alexandru May 18 '20 at 18:52
  • Hi @M.Alexandru, thank you for your quick response. I will add those to my spark configuration. If I understand correctly you say I should do: `val dfWorld = spark.read.parquet("output/dfWorld").repartitionBy($"GroupId",$"PersonId").sortWithInPartitions("PersonId")` or should I do it when writing: `dfWorld.write.repartitionBy($"GroupId",$"PersonId").sortWithInPartitions.("PersonId").parquet("output/dfWorld")`? My tasks are not hanging. They just ran out of disk space when flushing went over 1TB but I increased my disk to 2TB now. – Bob Grijpstra May 18 '20 at 19:13
  • I'm using the first one. I think saying to it to repartition by some keys even if the parquet files are partitioned by the same columns it will give him a hint where to get information from. What spark version and what executor/ driver configuration are you using? – M. Alexandru May 19 '20 at 12:40
  • Ok I will try it upon read. I'm using Spark 2.4.5 and using `val spark = SparkSession.builder.appName("console").master("local[*]").config("spark.sql.shuffle.partitions", 3200).config("spark.default.parallelism", 3200).getOrCreate}` – Bob Grijpstra May 19 '20 at 12:57
  • executor memory, number of executor, number of cores? – M. Alexandru May 19 '20 at 13:07
  • I'm using an Azure VM D8s_v3 with 8 vCPUs and 32GB of RAM and running the application with -Xmx30G. I didn't touch the executor memory, number of executors and number of cores for the spark session. So I guess they are default. – Bob Grijpstra May 19 '20 at 13:31
  • I never used azure, (I'm using bare metals) but I think you need to specify the number of executors and executor cores . spark.conf.set("spark.executor.instances", 1) spark.conf.set("spark.executor.cores", 7) spark.conf.set("spark.executor.memory", '28g') One core is for driver. If you are using multiple instances you can increase the amount of executors – M. Alexandru May 19 '20 at 13:42
  • Man, you run this kind of code also in production with local[*] ? This can run you spark submit on a single machine and your cluster it will not used completly. You don't have to set master when run spark on cluster mode – gccodec May 20 '20 at 12:46

1 Answers1

0

What is the goal you're trying to achieve? Why do you need to have it joined?

Join for a sake of join will take you nowhere, unless you have enough memory/disk space to collect 1TB x 100MB worth of data

Edited based on response

If you only need records related to persons that are presented in dfPerson then you don't need right/left join, inner join would be what you want.

Broadcast will only work if your DF is less than broadcast settings in your Spark (10 Mb by default), it's ignored otherwise.

dfPerson.as("p").join(
    dfWorld.select(
        $"GroupId", $"PersonId", 
        $"<feature1YouNeed>", $"<feature2YouNeed>" 
    ).as("w"), 
    Seq("GroupId", "PersonId")
)

This should give you feature you're up to

NB: Replace < feature1YouNeed > and < feature2YouNeed > with actual column names.

ZakukaZ
  • 528
  • 6
  • 11