0

I'm currently running some tests on the latest google-cloud-pubsub==0.35.4 pubsub release. My intention is to process a never ending stream (variating in load) using a dynamic amount of subscriber clients.

However, when i have a queue of say.. 600 messages and 1 client running and then add additional clients:

  • Expected: All remaining messages get distributed evenly across all clients
  • Observed: Only new messages are distributed across clients, any older messages are send to pre-existing clients

Below is simplified version of what i use for my clients (for reference we'll only be running the low priority topic). I won't include the publisher since it has no relation.

PRIORITY_HIGH = 1
PRIORITY_MEDIUM = 2
PRIORITY_LOW = 3

MESSAGE_LIMIT = 10
ACKS_PER_MIN = 100.00
ACKS_RATIO = {
    PRIORITY_LOW: 100,
}

PRIORITY_TOPICS = {
    PRIORITY_LOW: 'test_low',
}

PRIORITY_SEQUENCES = {
    PRIORITY_LOW: [PRIORITY_LOW, PRIORITY_MEDIUM, PRIORITY_HIGH],
}


class Subscriber:
    subscriber_client = None
    subscriptions = {}

    priority_queue = defaultdict(Queue.Queue)
    priorities = []

    def __init__(self):
        logging.basicConfig()
        self.subscriber_client = pubsub_v1.SubscriberClient()

        for option, percentage in ACKS_RATIO.iteritems():
            self.priorities += [option] * percentage

    def subscribe_to_topic(self, topic, max_messages=10):
        self.subscriptions[topic] = self.subscriber_client.subscribe(
            BASE_TOPIC_PATH.format(project=PROJECT, topic=topic,),
            self.process_message,
            flow_control=pubsub_v1.types.FlowControl(
                max_messages=max_messages,
            ),
        )

    def un_subscribe_from_topic(self, topic):
        subscription = self.subscriptions.get(topic)
        if subscription:
            subscription.cancel()
            del self.subscriptions[topic]

    def process_message(self, message):
        json_message = json.loads(message.data.decode('utf8'))
        self.priority_queue[json_message['priority']].put(message)

    def retrieve_message(self):
        message = None
        priority = random.choice(self.priorities)
        ack_priorities = PRIORITY_SEQUENCES[priority]

        for ack_priority in ack_priorities:
            try:
                message = self.priority_queue[ack_priority].get(block=False)
                break
            except Queue.Empty:
                pass

        return message


if __name__ == '__main__':
    messages_acked = 0

    pub_sub = Subscriber()
    pub_sub.subscribe_to_topic(PRIORITY_TOPICS[PRIORITY_LOW], MESSAGE_LIMIT * 3)

    while True:
        msg = pub_sub.retrieve_message()
        if msg:
            json_msg = json.loads(msg.data.decode('utf8'))

            msg.ack()
            print ("%s - Akked Priority %s , High %s, Medium %s, Low %s" % (
                datetime.datetime.now().strftime('%H:%M:%S'),
                json_msg['priority'],
                pub_sub.priority_queue[PRIORITY_HIGH].qsize(),
                pub_sub.priority_queue[PRIORITY_MEDIUM].qsize(),
                pub_sub.priority_queue[PRIORITY_LOW].qsize(),
            ))

        time.sleep(60.0 / ACKS_PER_MIN)

I'm wondering if this behaviour as inherent to how streaming pulls function or if there are configurations that can alter this behaviour.

Cheers!

tehhowch
  • 9,645
  • 4
  • 24
  • 42
Sam1nator
  • 1
  • 1

1 Answers1

0

Considering the Cloud Pub/Sub documentation, Cloud Pub/sub delivers each published message at least once for every subscription, nevertheless there are some exception to this behavior:

  • A message that cannot be delivered within the maximum retention time of 7 days is deleted.
  • A message published before a given subscription was created will not be delivered.

In other words, the service will deliver the messages to the subscriptions created before the message was published, therefore, the old messages will not be available for new subscriptions. As far as I know, Cloud Pub/Sub does not offer a feature to change this behavior.