2

I am looking to stream events from from PubSub into BigQuery using Dataflow. I see that there are two templates for doing this in GCP: one where Dataflow reads messages from a topic; and one from a subscription.

What are the advantages of using a subscription here, rather than just consuming the events from the topic?

vdenotaris
  • 13,297
  • 26
  • 81
  • 132
Rich Ashworth
  • 1,995
  • 4
  • 19
  • 29

2 Answers2

3

Core concepts

  • Topic: A named resource to which messages are sent by publishers.

  • Subscription: A named resource representing the stream of messages from a single, specific topic, to be delivered to the subscribing application.

According to the core concepts, the the difference is rather simple:

  • Use a Topic when you would like to publish messages from Dataflow to Pub/Sub (indeed, for a given topic).

  • Use a Subscription when you would like to consume messages coming from Pub/Sub in Dataflow.

Thus, in your case, go for a subscription.

More info:

Keep into account that Pub/Sub manages topics using is own message store. However, a Cloud Pub/Sub Topic to BigQuery template is particularly useful when you would like to move these messages as well in BigQuery (and eventually perform your own analysis).

The Cloud Pub/Sub Topic to BigQuery template is a streaming pipeline that reads JSON-formatted messages from a Cloud Pub/Sub topic and writes them to a BigQuery table. You can use the template as a quick solution to move Cloud Pub/Sub data to BigQuery. The template reads JSON-formatted messages from Cloud Pub/Sub and converts them to BigQuery elements.

https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#cloudpubsubtobigquery


Disclaimer: Comments and opinions are my own and not the views of my employer.

vdenotaris
  • 13,297
  • 26
  • 81
  • 132
  • Thanks, @vdenotaris. I'm still not sure why there is a template for consuming messages directly from a topic in Dataflow in that case (see https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming)? – Rich Ashworth May 24 '19 at 15:25
  • 1
    @RichAshworth Pub/Sub uses a message store in order to manage topics. However, the template you mentioned can be useful in case you'd like to keep these messages as well in BigQuery for further analysis, for instance even after an ETL job. – vdenotaris May 24 '19 at 15:30
2

Both the Topic to BigQuery and Subscription to BigQuery templates consume messages from Pub/Sub and stream them into BigQuery.

If you use the Topic to BigQuery template, Dataflow will create a subscription behind the scenes for you that reads from the specified topic. If you use the Subscription to BigQuery template, you will need to provide your own subscription.

You can use Subscription to BigQuery templates to emulate the behavior of a Topic to BigQuery template by creating multiple subscription-connected BigQuery pipelines reading from the same topic.

For new deployments, using the Subscription to BigQuery template is preferred. If you stop and restart a pipeline using the Topic to BigQuery template, a new subscription will be created, which may cause you to miss some messages that were published while the pipeline was down. The Subscription to BigQuery template doesn't have this disadvantage, since it uses the same subscription even after the pipeline is restarted.

Lauren
  • 844
  • 4
  • 8