1

Ok, so, I have autoloader working in directory listing mode because the event driven mode requires way more elevated permissions that we can't in LIVE.

So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .

That's about it.

My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?

From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?

I plan to try all of them out, but if someone already has pointers, will be appreciated.

By the way, the storage here is Azure data lake gen 2.

Saugat Mukherjee
  • 778
  • 8
  • 32

1 Answers1

1

If you don't do too many complex aggregations, then I would recommend to get to the "Compute Optimized" or "General Purpose" nodes for that work - the primary load would be anyway reading the data from files, combine them together and then write to ADLS, so here the more CPU power, the faster will be the data processing.

Only if you'll have too many small files (think about tens/hundreds of thousands) then you may consider bigger node for a driver whose role will be identifying the new files in the storage.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Upvoted. This is a very good point. I just noticed that my job had been failing because of memory pressure on the driver. I will try to assign a bigger node for the driver. I will come back and mark this as an answer if it works. – Saugat Mukherjee Dec 19 '22 at 11:26
  • Helped. I see the job still failed but after a while and the next time , it started , it progressed. Since this is the initial run and we have many existing files, this is ok. On a regukar basis, it would run frequently, so this shouldn’t be a problem. One last question : pointers on how one should run the directory listing mode efficiently? I was thinking of cleaning up landing zone files as one. Any others? There’s one where I can tweak the interval when it does the full listing again (i read that it does so, periodically). – Saugat Mukherjee Dec 20 '22 at 04:48
  • if you don't use notification mode, then it's doing full listing each time. Regarding optimization - usual trick is to apply retention policy to the landing zone, so older files will be removed automatically. Otherwise, I would try to use notification mode - just pre-create all necessary objects yourself, not automatically create them when pipeline is running – Alex Ott Dec 20 '22 at 08:03