I've been struggling with this problem for a while and can't quite find a fix. I'm building a pipeline that takes data from a public google cloud bucket and does some transformations on it. The thing I'm struggling with right now is getting apache beam to receive pub/sub-messages whenever a file is uploaded to the cloud. There is a public topic for this called projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16
. The bucket is the public goes-16 bucket: https://console.cloud.google.com/storage/browser/gcp-public-data-goes-16/. Particularly I'm interested in the ABI-RadC folder so I initialized my subscriber with this:
gcloud beta pubsub subscriptions create goes16-ABI-data-sub-filtered-test --project my-project --topic projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16 --message-filter='hasPrefix(attributes.objectId,"ABI-L1b-RadC/")' --enable-message-ordering
So this works for the most part I get ABI-RadC messages in order about every 5 minutes. However, I should be getting 16 messages (one for each band) every 5ish minutes as that's when things are published to cloud-storage. But instead, I always get less (anywhere from 2-13) messages every 5 minutes. At first, I thought maybe cloud storage is messing something up so I checked google cloud every 5 minutes and there are files in google cloud which I have not received messages for in apache beam. Here is the code and output I'm using to debug this problem.
import apache_beam as beam
import os
import datetime
import json
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = my_key_path
options = {
'streaming': True
}
def print_name(message):
result = json.loads(message)
file_string = result['name']
band_id = ((file_string[file_string.find("M6")+1:file_string.find("_G16")]))
key = ((file_string[file_string.find("s")+1:file_string.find("_e")]))
print(f"Message recieved at : {datetime.datetime.utcnow()} key : {key} band : {band_id}")
runner = 'DirectRunner'
opts = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(runner, options=opts) as p:
sub_message = (p | 'Sub' >> beam.io.ReadFromPubSub(subscription='my_sub_path'))
sub_message | 'print name' >> beam.FlatMap(print_name)
job = p.run()
if runner == 'DirectRunner':
job.wait_until_finish()
output:
Message recieved at : 2020-08-16 23:19:05.360728 key : 20202292316171 band : 6C04
Message recieved at : 2020-08-16 23:19:18.464376 key : 20202292316171 band : 6C13
Message recieved at : 2020-08-16 23:19:18.980477 key : 20202292316171 band : 6C14
Message recieved at : 2020-08-16 23:19:19.972165 key : 20202292316171 band : 6C03
Message recieved at : 2020-08-16 23:19:21.116554 key : 20202292316171 band : 6C05
Message recieved at : 2020-08-16 23:24:03.847833 key : 20202292321171 band : 6C04
Message recieved at : 2020-08-16 23:24:16.814699 key : 20202292321171 band : 6C06
Message recieved at : 2020-08-16 23:24:17.393739 key : 20202292321171 band : 6C08
Message recieved at : 2020-08-16 23:29:07.558796 key : 20202292326171 band : 6C04
Message recieved at : 2020-08-16 23:29:21.100278 key : 20202292326171 band : 6C13
Message recieved at : 2020-08-16 23:29:21.771230 key : 20202292326171 band : 6C15
Message recieved at : 2020-08-16 23:34:15.474699 key : 20202292331171 band : 6C15
Message recieved at : 2020-08-16 23:34:16.006153 key : 20202292331171 band : 6C12
The Key is just a timestamp of the file. So as you can see I don't receive 16 messages every 5 minutes even though there are files on the cloud that I'm missing. I also tried making new subscribers without the --enable-ordering
and hasPrefix
but it doesn't change anything. Any help appreciated.
UPDATE 1 So I decided to do another test to see if it was apache beam or maybe I setup my subscription wrong. So I used the following code:
from concurrent.futures import TimeoutError
from google.cloud import pubsub_v1
# TODO(developer)
project_id = "fire-neural-network"
subscription_id = "custom"
# Number of seconds the subscriber should listen for messages
# timeout = 5.0
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(project_id, subscription_id)
def callback(message):
objectId = message.attributes.get('objectId')
print("Received message: {}".format(objectId))
message.ack()
streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
print("Listening for messages on {}..\n".format(subscription_path))
# Wrap subscriber in a 'with' block to automatically call close() when done.
with subscriber:
try:
# When `timeout` is not set, result() will block indefinitely,
# unless an exception is encountered first.
streaming_pull_future.result()
except TimeoutError:
streaming_pull_future.cancel()
To check to see if I got 16 messages every 5 minutes and indeed I did. So it must be something wrong with my apache beam code and not the subscription. Additionally, I noticed that apache beam is not acking my messages and whereas the above code does. I think this is the cause of the error. But I'm not sure how to make apache actually ack the messages though I've looked here: When does Dataflow acknowledge a message of batched items from PubSubIO? and it says to add a groupbykey after my pub/sub which I tried, but it still doesn't work. :
import apache_beam as beam
import os
import datetime
import json
# gcloud beta pubsub subscriptions create custom --project fire-neural-network --topic projects/gcp-public-data---goes-16/topics/gcp-public-data-goes-16 --message-filter='hasPrefix(attributes.objectId,"ABI-L1b-RadC/")' --enable-message-ordering
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/home/n/Keys/fire-neural-network-3b8e8eff4400.json"
options = {
'streaming': True
}
def print_name(message):
result = json.loads(message)
if 'ABI-L1b-RadC' in result['name']:
file_string = result['name']
band_id = ((file_string[file_string.find("M6")+1:file_string.find("_G16")]))
key = ((file_string[file_string.find("s")+1:file_string.find("_e")]))
print(f"Message recieved at : {datetime.datetime.utcnow()} key : {key} band : {band_id}")
output_path = 'gs://fire-neural-network'
runner = 'DirectRunner'
opts = beam.pipeline.PipelineOptions(flags=[], **options)
with beam.Pipeline(runner, options=opts) as p:
sub_message = (
pipeline
| "Read PubSub Messages" >> beam.io.ReadFromPubSub(subscription='mysub')
| "Window into" >> beam.WindowInto(beam.transforms.window.FixedWindows(5))
)
grouped_message = (sub_message | "Add Dummy Key" >> beam.Map(lambda elem: (None, elem))
| "Groupby" >> beam.GroupByKey()
| "Abandon Dummy Key" >> beam.MapTuple(lambda _, val: val)
)
grouped_message | "Write" >> beam.io.WriteToPubSub('mytopic')
grouped_message | "Print" >> beam.Map(print_name)
job = p.run()
if runner == 'DirectRunner':
job.wait_until_finish()
** UPDATE 2 ** I made my own separate test topic and subscription to test if maybe it was the GCS subscription interacting with apache beam that was causing the problem. When I create my own subscription and topic all the messages are acked properly, it's just this public bucket combined with apache beam that has the weird behaviour. I may just end up scrapping my beam pipeline and just writing my own unoptimized pipeline using google's pub/sub API.