How to effectively process binary files served from ftp and store results on GCS

Asked Mar 27 '17 at 20:43

Active Mar 27 '17 at 20:43

Viewed 268 times

I need to download about 2 millions of gunzip files from ftp server (not sftp), process them and store results (jpeg images) on google cloud storage. I have considered to spin a dataproc cluster, then get files from ftp and process with Spark. But not sure how well Spark will handle these binary files.

Could someone please suggest a better approach?

Thanks

asked Mar 27 '17 at 20:43

AlonS

How large / small are the gzipped files? Spark has a built-in SparkContext.binaryFiles method, but it is best for smaller files. – Angus Davis Mar 27 '17 at 22:03
@AngusDavis files are about 100KB to 10MB. Is that considered small/big? – AlonS Mar 28 '17 at 06:44
1

10MB is probably getting into 'not quite small' territory, but it's still worth a shot to see what happens (I could see there being excess shuffle / poorly balanced partitions, depending on how things shake out, but should be fairly trivial run a test). Something to consider is breaking this into two steps: step 1) distcp from ftp to GCS and step 2) process files from GCS. It might make experimentation less expensive by reducing FTP operations. – Angus Davis Mar 28 '17 at 20:16

How to effectively process binary files served from ftp and store results on GCS

0 Answers0