0

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :

  1. I need to tune it, as I am getting OOM errors due to Java heap space.
  2. I have to apply left join.

There will not be any null values, so it might not improve performance.

  1. What should I do to achieve this scenario?

Jfyi: While loading this parquet data I have already applied repartition based on a column.

  1. I have loaded both df1 and df2
  2. When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
  3. Applied repartitioning on both df to evenly distribute the data
  • Can you add more details like the size of the cluster used, what instances are used, spark configurations used? Its easier to come up with a solution that way. Cheers – Rohit Anil Nov 23 '22 at 14:24
  • 10 node 100 gb 3 num-executors. – Red Maple Nov 23 '22 at 19:14
  • Also this tuning was more from the application level tuning by techniques like repartitioning, salting etc. – Red Maple Nov 23 '22 at 19:14
  • Could you share the instance type as well. Also what was the idea behind choosing 3 executors? – Rohit Anil Nov 23 '22 at 19:20
  • That is the predefined cluster configuration. My question is more from what best we can do at application level – Red Maple Nov 24 '22 at 07:33
  • I think the best you can do is do is spark executor tuning. Check out this link https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/ – Rohit Anil Nov 24 '22 at 07:36

0 Answers0