Ok, so, I have autoloader working in directory listing mode
because the event driven mode
requires way more elevated permissions that we can't in LIVE.
So, basically what the autoloader does is : reads parquet files, from many different folders iteratively, from the landing zone (many small files), and then writes them into a raw container as delta lake , with schema inference and evolution, creates external tables and does an optimize .
That's about it.
My question is: for this workload, what should be the ideal node type (worker and driver) of my cluster in Azure? Meaning should it "Compute Optimized", "Storage Optimized" or "Memory optimized" ?
From this link, I could see that "Compute optimized" would probably be the best choice, but I was wondering that my job, does most of the work reading landing files (many small files) and writes delta files, checkpoints and schemas, so shouldn't storage optimized be best here?
I plan to try all of them out, but if someone already has pointers, will be appreciated.
By the way, the storage here is Azure data lake gen 2.