0

As the question states, is there a way to estimate the amount of time it will take to write a spark dataframe to a file such as a parquet? I don't like waiting indefinitely knowing that I have the best instance to handle the task and it's already been over an hour.

So please, if anyone knows a way to optimize this and/or get a good estimate of how long it will take please post your answer below.

Ravaal
  • 3,233
  • 6
  • 39
  • 66
  • You have to profile your job and environment. Big Query has a huge variance in query performance for instance. Single or multi user environment? What kind of storage are you using? There are alot of factors affecting performance. – Molotch Jan 09 '20 at 21:31
  • Single user environment. I'm using EC2 X1 memory optimized with over 1k gb's of ram. – Ravaal Jan 09 '20 at 21:41
  • So, try to run a small subset of a full job then a larger subset and then yet another larger and see if the increase in processing time is linear or exponential. It's also a good idea to log the different stages of your processing. – Molotch Jan 09 '20 at 22:17
  • I just asked another question that's relevant to this one. Please have a look and et back to me here if it's relevant. https://stackoverflow.com/questions/59673909/how-do-i-reduce-a-spark-dataframe-to-the-same-amount-of-rows-for-each-value-in-a – Ravaal Jan 10 '20 at 00:17

0 Answers0