I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)