I need to download about 2 millions of gunzip files from ftp server (not sftp), process them and store results (jpeg images) on google cloud storage. I have considered to spin a dataproc cluster, then get files from ftp and process with Spark. But not sure how well Spark will handle these binary files.
Could someone please suggest a better approach?
Thanks