1

Our python Dataflow pipeline works locally but not when deployed using the Dataflow managed service on Google Cloud Platform. It doesn't show signs that it is connected to the PubSub subscription. We have tried subscribing to both subscription and topic, neither of them worked. The messages accumulate in the PubSub subscription and the Dataflow pipeline doesn't show signs of being called or anything. We have double-checked the project is the same

Any directions on this would be very much appreciated

Here is the code to connect to a pull subscription

with beam.Pipeline(options=options) as p:
        something = p | "ReadPubSub" >> beam.io.ReadFromPubSub(
            subscription="projects/PROJECT_ID/subscriptions/cloudflow"
        )

Here goes the options used

 options = PipelineOptions()
 file_processing_options = PipelineOptions().view_as(FileProcessingOptions)
 if options.view_as(GoogleCloudOptions).project is None:
        print(sys.argv[0] + ": error: argument --project is required")
        sys.exit(1)
 options.view_as(SetupOptions).save_main_session = True
 options.view_as(StandardOptions).streaming = True

The PubSub subscription has this configuration:

Delivery type: Pull
Subscription expiration: Subscription expires in 31 days if there is no activity.
Acknowledgement deadline: 57 Seconds
Subscription filter: —
Message retention duration: 7 Days
Retained acknowledged messages: No
Dead lettering: Disabled
Retry policy : Retry immediately
GRT
  • 103
  • 8

2 Answers2

2

Very late answer, it may still help someone else. I had the same problem, solved it like this:

  1. Thanks to user Paramnesia1 who wrote this answer, I figured out that I was not observing all the logs on Logs Explorer. Some default job_name query filters were preventing me from that. I am quoting & claryfing the steps to follow to be able to see all logs:

Open the Logs tab in the Dataflow Job UI, section Job Logs

Click the "View in Logs Explorer" button

In the new Logs Explorer screen, in your Query window, remove all the existing "logName" filters, keep only resource.type and resource.labels.job_id

  1. Now you will be able to see all the logs and investigate further your error. In my case, I was getting some 'Syncing Pod' errors, which were due to importing the wrong data file in my setup.py.
1

I think for Pulling from subscription we need to pass with_attributes parameter as True.

with_attributes – True - output elements will be PubsubMessage objects. False - output elements will be of type bytes (message data only).

Found similar one here: When using Beam IO ReadFromPubSub module, can you pull messages with attributes in Python? It's unclear if its supported