Join 2 large size tables (50 Gb and 1 billion records)

Asked Nov 21 '22 at 18:02

Active Nov 21 '22 at 18:02

Viewed 201 times

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :

There will not be any null values, so it might not improve performance.

Jfyi: While loading this parquet data I have already applied repartition based on a column.

I have loaded both df1 and df2
When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
Applied repartitioning on both df to evenly distribute the data

asked Nov 21 '22 at 18:02

Red Maple

Can you add more details like the size of the cluster used, what instances are used, spark configurations used? Its easier to come up with a solution that way. Cheers – Rohit Anil Nov 23 '22 at 14:24
10 node 100 gb 3 num-executors. – Red Maple Nov 23 '22 at 19:14
Also this tuning was more from the application level tuning by techniques like repartitioning, salting etc. – Red Maple Nov 23 '22 at 19:14
Could you share the instance type as well. Also what was the idea behind choosing 3 executors? – Rohit Anil Nov 23 '22 at 19:20
That is the predefined cluster configuration. My question is more from what best we can do at application level – Red Maple Nov 24 '22 at 07:33
I think the best you can do is do is spark executor tuning. Check out this link https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/ – Rohit Anil Nov 24 '22 at 07:36

0 Answers0