0

I need to download about 2 millions of gunzip files from ftp server (not sftp), process them and store results (jpeg images) on google cloud storage. I have considered to spin a dataproc cluster, then get files from ftp and process with Spark. But not sure how well Spark will handle these binary files.

Could someone please suggest a better approach?

Thanks

AlonS
  • 683
  • 2
  • 8
  • 16
  • How large / small are the gzipped files? Spark has a built-in SparkContext.binaryFiles method, but it is best for smaller files. – Angus Davis Mar 27 '17 at 22:03
  • @AngusDavis files are about 100KB to 10MB. Is that considered small/big? – AlonS Mar 28 '17 at 06:44
  • 1
    10MB is probably getting into 'not quite small' territory, but it's still worth a shot to see what happens (I could see there being excess shuffle / poorly balanced partitions, depending on how things shake out, but should be fairly trivial run a test). Something to consider is breaking this into two steps: step 1) distcp from ftp to GCS and step 2) process files from GCS. It might make experimentation less expensive by reducing FTP operations. – Angus Davis Mar 28 '17 at 20:16

0 Answers0