Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

Question

I am considering Google DataFlow as an option for running a pipeline that involves steps like:

Downloading images from the web;
Processing images.

I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.

Since you have tagged your question with azure-data-factory, you might be interested to learn you can achieve the same with Azure Data Factory. See https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities. Your code can run on VMs (with managed lifecycle by Azure Batch), or on managed infrastructure using Azure Data Lake Analytics (see https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity). — Alexandre Gattiker, Jun 19 '17 at 07:24
Thank you so much for your comments. I will try both options. — kpax, Jun 20 '17 at 09:24

Pablo · Accepted Answer · 2018-05-09T16:08:03.023

This use case is a possible application for Dataflow/Beam.

If you want to do this in a streaming fashion, you could have a crawler generating URLs and adding them to a PubSub or Kafka queue; and code a Beam pipeline to do the following:

Read from PubSub
Download the website content in a ParDo
Parse image URLs from the website in another ParDo*
Download each image and process it, again with a ParDo
Store the result in GCS, BigQuery, or others, depending on what information you want from the image.

You can do the same with a batch job, just changing the source you're reading the URLs from.

*After parsing those image URLs, you may also want to reshuffle your data, to gain some parallelism.

Does it make sense to use Google DataFlow/Apache Beam to parallelize image processing or crawling tasks?

1 Answers1

Linked