5

I am considering Google DataFlow as an option for running a pipeline that involves steps like:

  1. Downloading images from the web;
  2. Processing images.

I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.

Community
  • 1
  • 1
kpax
  • 621
  • 1
  • 8
  • 18
  • 2
    Since you have tagged your question with azure-data-factory, you might be interested to learn you can achieve the same with Azure Data Factory. See https://learn.microsoft.com/en-us/azure/data-factory/data-factory-use-custom-activities. Your code can run on VMs (with managed lifecycle by Azure Batch), or on managed infrastructure using Azure Data Lake Analytics (see https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity). – Alexandre Gattiker Jun 19 '17 at 07:24
  • Thank you so much for your comments. I will try both options. – kpax Jun 20 '17 at 09:24
  • 1
    @Pablo you should write this as an answer – Allan Veloso Jul 09 '17 at 19:48

1 Answers1

3

This use case is a possible application for Dataflow/Beam.

If you want to do this in a streaming fashion, you could have a crawler generating URLs and adding them to a PubSub or Kafka queue; and code a Beam pipeline to do the following:

  1. Read from PubSub
  2. Download the website content in a ParDo
  3. Parse image URLs from the website in another ParDo*
  4. Download each image and process it, again with a ParDo
  5. Store the result in GCS, BigQuery, or others, depending on what information you want from the image.

You can do the same with a batch job, just changing the source you're reading the URLs from.

*After parsing those image URLs, you may also want to reshuffle your data, to gain some parallelism.

Pablo
  • 10,425
  • 1
  • 44
  • 67