Apache Beam not properly receiving pub/sub messages from google-cloud-storage

Question

I've been struggling with this problem for a while and can't quite find a fix. I'm building a pipeline that takes data from a public google cloud bucket and does some transformations on it. The thing I'm struggling with right now is getting apache beam to receive pub/sub-messages whenever a file is uploaded to the cloud. There is a public topic for this called projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16. The bucket is the public goes-16 bucket: https://console.cloud.google.com/storage/browser/gcp-public-data-goes-16/. Particularly I'm interested in the ABI-RadC folder so I initialized my subscriber with this:

gcloud beta pubsub subscriptions create goes16-ABI-data-sub-filtered-test --project my-project --topic projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16 --message-filter='hasPrefix(attributes.objectId,"ABI-L1b-RadC/")' --enable-message-ordering

So this works for the most part I get ABI-RadC messages in order about every 5 minutes. However, I should be getting 16 messages (one for each band) every 5ish minutes as that's when things are published to cloud-storage. But instead, I always get less (anywhere from 2-13) messages every 5 minutes. At first, I thought maybe cloud storage is messing something up so I checked google cloud every 5 minutes and there are files in google cloud which I have not received messages for in apache beam. Here is the code and output I'm using to debug this problem.

import apache_beam as beam
import os 
import datetime
import json

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = my_key_path

options = {
    'streaming': True
}

def print_name(message):
    result = json.loads(message)
    file_string = result['name']
    band_id = ((file_string[file_string.find("M6")+1:file_string.find("_G16")]))
    key = ((file_string[file_string.find("s")+1:file_string.find("_e")]))
    print(f"Message recieved at : {datetime.datetime.utcnow()}   key : {key}    band : {band_id}")

runner = 'DirectRunner'
opts = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(runner, options=opts) as p:
    sub_message = (p | 'Sub' >> beam.io.ReadFromPubSub(subscription='my_sub_path'))
    sub_message | 'print name' >> beam.FlatMap(print_name)


    job = p.run()
    if runner == 'DirectRunner':
        job.wait_until_finish()

output:

Message recieved at : 2020-08-16 23:19:05.360728   key : 20202292316171    band : 6C04
Message recieved at : 2020-08-16 23:19:18.464376   key : 20202292316171    band : 6C13
Message recieved at : 2020-08-16 23:19:18.980477   key : 20202292316171    band : 6C14
Message recieved at : 2020-08-16 23:19:19.972165   key : 20202292316171    band : 6C03
Message recieved at : 2020-08-16 23:19:21.116554   key : 20202292316171    band : 6C05
Message recieved at : 2020-08-16 23:24:03.847833   key : 20202292321171    band : 6C04
Message recieved at : 2020-08-16 23:24:16.814699   key : 20202292321171    band : 6C06
Message recieved at : 2020-08-16 23:24:17.393739   key : 20202292321171    band : 6C08
Message recieved at : 2020-08-16 23:29:07.558796   key : 20202292326171    band : 6C04
Message recieved at : 2020-08-16 23:29:21.100278   key : 20202292326171    band : 6C13
Message recieved at : 2020-08-16 23:29:21.771230   key : 20202292326171    band : 6C15
Message recieved at : 2020-08-16 23:34:15.474699   key : 20202292331171    band : 6C15
Message recieved at : 2020-08-16 23:34:16.006153   key : 20202292331171    band : 6C12

The Key is just a timestamp of the file. So as you can see I don't receive 16 messages every 5 minutes even though there are files on the cloud that I'm missing. I also tried making new subscribers without the --enable-ordering and hasPrefix but it doesn't change anything. Any help appreciated.

UPDATE 1 So I decided to do another test to see if it was apache beam or maybe I setup my subscription wrong. So I used the following code:

from concurrent.futures import TimeoutError
from google.cloud import pubsub_v1

# TODO(developer)
project_id = "fire-neural-network"
subscription_id = "custom"
# Number of seconds the subscriber should listen for messages
# timeout = 5.0

subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)

def callback(message):
    objectId = message.attributes.get('objectId')
    print("Received message: {}".format(objectId))
    message.ack()

streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print("Listening for messages on {}..\n".format(subscription_path))

# Wrap subscriber in a 'with' block to automatically call close() when done.
with subscriber:
    try:
        # When `timeout` is not set, result() will block indefinitely,
        # unless an exception is encountered first.
        streaming_pull_future.result()
    except TimeoutError:
        streaming_pull_future.cancel()

To check to see if I got 16 messages every 5 minutes and indeed I did. So it must be something wrong with my apache beam code and not the subscription. Additionally, I noticed that apache beam is not acking my messages and whereas the above code does. I think this is the cause of the error. But I'm not sure how to make apache actually ack the messages though I've looked here: When does Dataflow acknowledge a message of batched items from PubSubIO? and it says to add a groupbykey after my pub/sub which I tried, but it still doesn't work. :

import apache_beam as beam
import os 
import datetime
import json

# gcloud beta pubsub subscriptions create custom --project fire-neural-network --topic projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16 --message-filter='hasPrefix(attributes.objectId,"ABI-L1b-RadC/")' --enable-message-ordering


os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/home/n/Keys/fire-neural-network-3b8e8eff4400.json"

options = {
    'streaming': True
}

def print_name(message):
    result = json.loads(message)
    if 'ABI-L1b-RadC' in result['name']:
        file_string = result['name']
        band_id = ((file_string[file_string.find("M6")+1:file_string.find("_G16")]))
        key = ((file_string[file_string.find("s")+1:file_string.find("_e")]))
        print(f"Message recieved at : {datetime.datetime.utcnow()}   key : {key}    band : {band_id}")

output_path = 'gs://fire-neural-network'
runner = 'DirectRunner'
opts = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(runner, options=opts) as p:
        sub_message = (
            pipeline
            | "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription='mysub')
            | "Window into" >> beam.WindowInto(beam.transforms.window.FixedWindows(5))
            
        )

        grouped_message = (sub_message | "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
        | "Groupby" >> beam.GroupByKey()
        | "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
        )

        grouped_message | "Write" >> beam.io.WriteToPubSub('mytopic')
        grouped_message | "Print" >> beam.Map(print_name)

    job = p.run()
    if runner == 'DirectRunner':
        job.wait_until_finish()

** UPDATE 2 ** I made my own separate test topic and subscription to test if maybe it was the GCS subscription interacting with apache beam that was causing the problem. When I create my own subscription and topic all the messages are acked properly, it's just this public bucket combined with apache beam that has the weird behaviour. I may just end up scrapping my beam pipeline and just writing my own unoptimized pipeline using google's pub/sub API.

Are you the only one consumer of the subscription? Can you perform the same test but by providing the topic and not the subscription and see if you have the correct number of file per 5 minutes? — guillaume blaquiere, Aug 17 '20 at 07:42
@guillaumeblaquiere Thanks for the response. Since it's a public pub/sub topic I assume I'm not the only consumer as everyone can access it as a subscriber. I tried what you stated by specifying ```topic='projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16'```, in ReadFromPubSub however I get a 403 permission denied error since I am not authorized to make subscribers on the goes-16 project. Also as far as I'm aware apache isn't letting me specify separate projects for the topic and temp subscription which is why at first I used a pre-made subscription. — kauii8, Aug 17 '20 at 19:03
Just to precise some stuff. You have topic, subscription and subscriber. All the messages published in a topic are copied in ALL subscription (if there is no filter defined). Then the messages of a subscription are sent to the subscribers. The subscribers receive a subset of the message; all if there is only one subscriber, or if the other subscribers never ack their messages. — guillaume blaquiere, Aug 17 '20 at 19:15
@guillaumeblaquiere thanks for the quick response. Ok so in my case the subscriber is apache beam and my subscription is on google cloud. If this is the case I am the only subscriber/consumer to the subscription. — kauii8, Aug 17 '20 at 19:51

Apache Beam not properly receiving pub/sub messages from google-cloud-storage

0 Answers0