-1

I have two large spark dataframe. I joined them by one common column as:

df_joined = df1.join(df2.select("id",'label'), "id")

I got the result, but when I want to work with df_joined, it's too slow. As I know, we need to repartition df1 and df2 to prevent large number of partition for df_joined. so, even, I changed the number of partitions,

df1r = df1.repartition(1)
df2r = df2.repartition(1)
df_joined = df1r.join(df2r.select("id",'label'), "id")

still NOT working. any IDEA?

Saeid SOHEILY KHAH
  • 747
  • 3
  • 10
  • 23
  • have you checked if you calculated a cross product by accident? is the result size of df_joined in its persisted state (as textfile, parquet, orc, whatever) somehow close to what you would expect? – Elmar Macek Sep 24 '18 at 07:52
  • The problem is that I can not save a parquet file. It takes lots of time. so, I cancel it.! – Saeid SOHEILY KHAH Sep 25 '18 at 08:12
  • try the following: select like 3 different rows from df1 (with 3 different ids) and the corresponding joinpartners plus 2 or 3 random tuples without a joinpartner from df2. Join those and then see, if the result is really what you would expect it to be. – Elmar Macek Sep 25 '18 at 08:37

1 Answers1

0

Spark runs 1 concurrent task for every partition of an RDD / DataFrame (up to the number of cores in the cluster).

If you’re cluster has 20 cores, you should have at least 20 partitions (in practice 2–3x times more). From the other hand a single partition typically shouldn’t contain more than 128MB.

so, instead of below two lines, which repartition your dataframe into 1 paritition:

df1r = df1.repartition(1)
df2r = df2.repartition(1)

Repartition your data based on 'id' column, joining key, into n partitions. ( n depends on data size and number of cores in cluster).

df1r = df1.repartition(n, "id")
df2r = df2.repartition(n, "id")
Lakshman Battini
  • 1,842
  • 11
  • 25