8

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.

What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?

Narek
  • 548
  • 6
  • 26

1 Answers1

1

In order to specify the filename and path, please see the TextIO documentation. You would provide the filename / path etc. to the output writer.

For your use case of multiple output files, you can use the Partition function to create multiple PCollections out of a single source PCollection.

Sam McVeety
  • 3,194
  • 1
  • 15
  • 38
Nick
  • 3,581
  • 1
  • 14
  • 36
  • I don't think I see anything on how to partition the output here, am I missing something? – Narek Jan 15 '16 at 21:52
  • You can construct strings for the filename / path using the data you have available. It's all a matter of properly constructing the pipeline so that the data is available when you want to construct the output dir / filename. – Nick Jan 15 '16 at 21:57
  • The issue is that TextIO.Write only takes a PCollection and within one PCollection, I have rows that correspond to hundreds of partitions. – Narek Jan 15 '16 at 22:02
  • You should [split the PCollection up along the partition lines](https://cloud.google.com/dataflow/model/multiple-pcollections#partition)? – Nick Jan 15 '16 at 22:10
  • The number of partitions there needs to be determinable at graph construction time, so that doesn't seem like it would work – bfabry Jan 16 '16 at 06:13
  • Just a followup, I had a few issues with this solution. The partition structure is not known before the query, and this doesn't seem to scale to more than ~10 partitions. My final result needs to scale to tens-of-thousands of output partitions. – Narek Jan 21 '16 at 00:06
  • There are numerous ways to potentially accomplish the end result you're envisioning, including kicking off new pipelines, being more clever about how you partition the data, using various forms of intermediate storage, etc. I think this might be too broad for stackoverflow, or at the very least you should edit your question to be a lot more concrete about your data and your pipeline. – Nick Jan 21 '16 at 22:29
  • @Nick nan you look at the last edit to this question? – Narek Feb 01 '16 at 23:05
  • Something like architecture advice is something that should be evolved by capable developers in reading the documentation along with potential consultation with [Cloud Platform Support through a support package](https://cloud.google.com/support/). It seems this is still too broad for stackoverflow, and the explanation given in the edit is still quite schematic. A custom sink, on cursory examination, seems a good solution, but whether there are others is hard to evaluate and again, somewhat too complex for stackoverflow which is a decisive, clear Q&A database, properly conceived. – Nick Feb 01 '16 at 23:15
  • We are not looking for architectural advice, we are looking for how we would do the equivalent of Cascading's template tap in google dataflow. I apologise if giving so much context was confusing, I was hoping the background would be useful. – bfabry Feb 03 '16 at 08:05
  • From looking into Cascading Template Taps, it seems as though you could accomplish the same by running a pipeline which outputs the data to partitions, and then spawn a pipeline for each partition. – Nick May 20 '16 at 19:50
  • Spawning a pipeline per customer does not sound like a reasonable solution. Nor does it resemble template taps. – bfabry May 23 '16 at 20:46
  • From what I've understood, reading the document you sent, it's merely a question of partitioning data. This is, as the original question, a very broad question, which can be satisfied in numerous ways, some of which have been discussed above. The implementation details of your system were not clear enough from the post to determine what specific requirements you have. You've decided to use Dataflow sink to write to GCS / bigquery, partitioning by customer ID. This is very reasonable and effective, and I suggest posting it as a self-answer. Cheers! – Nick May 24 '16 at 14:50
  • As a final note: while custom sinks don't work in streaming mode (yet), it's possible to use streaming mode to write to a single table in Bigquery, using the customer ID as a column. Another process could periodically select from this table, insert into the individuated table, and then delete the rows which were selected. This could be the basis of a workaround for stream mode pipelines to still get their data to the same location as batch mode pipelines. – Nick May 24 '16 at 16:06