0

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.

To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.

For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.

Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?

(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)

siamsot
  • 1,501
  • 1
  • 14
  • 20
nsandersen
  • 896
  • 2
  • 16
  • 41
  • Hi, can you explain why a mono thread will help you? Dataflow is serial in term of precessing operation (box in the graph) but each box can be parallelized with a subset of the data to process. – guillaume blaquiere Aug 05 '19 at 15:51
  • A mono thread would allow me to not include an additional component (for instance another database) for synchronization. – nsandersen Aug 06 '19 at 07:37

1 Answers1

2

Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.

It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

Andrew Pilloud
  • 418
  • 2
  • 6
  • Distinct sounds like a plausible approach for this in that case. So if I understand you I would accumulate over one or a few minutes and then fire after the end of each of these windows. This way duplicates in each window would be handled by the Distinct/Combine step and duplicates across (from later) windows I could handle by checking the history table as the window should give me enough time to update it without needing additional synchronisation. I will try that and mark as the correct answer if successful! – nsandersen Aug 06 '19 at 09:45