0

I'm trying to write a simple consumer java program which consumes messages from Google Cloud Pub/Sub and do de-duplication and ordering of the messages.

I failed to find a simple sample program which do that. I've read google documentation and they refer the user to use Apache Beam. However I'm not familiar with Apache Beam, and I want to have a basic sample program which demonstrates this capability. Something which simply gets a comparator and knows to remove duplicated messages and give the messages ordered by attribute.

Can someone give such sample java program?

Eliyahu Machluf
  • 1,251
  • 8
  • 17

2 Answers2

1

If nothing exists it's because it's not "really" possible.

1st, it's useful to ask "When Pubsub generate double values?". Only when a message is delivered and the acknowledge is not received (or not emit in the expected time frame, 10s by default), or a no HTTP 200 is received in push mode.

2nd: What is Beam? Beam is a pipeline engine. You can plug your PubSub to it and your pipeline will read message and deduplicate them. Be careful this deduplication is performed in a windows of 10 to 20 minutes by Beam.

3rd: What mean "ordered"? Look at the ID of your message. The value is a timestamp, in microsecond (that's why PubSub can ingest up to 1M of message per second). Ordered the message mean having a message for sequential ID, else put in a buffer and wait for fill in the gaps. Of course, gaps will never be filled in...

Go back to Beam. Beam has the capability to define windows of observation. By the way, you can define, for example, sliding windows of 5 minutes, each windows starting every minute. When a window is closed, a PCollection of messages is triggered and it processed into your pipeline. On this finite collection, you can ordered your message.

With the same principle, you can remove depublicates manually in this collection.

Last info, PubSub is a backbone of Google service. It evolves, slowly because it's critical. But, maybe your requirement will be release a day!

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
1

Cloud Pub/Sub now supports ordered delivery. The feature is GA as of October 2020. To order messages, you set the enable_ordered_delivery property on a subscription to true and you set the ordering_key property on messages you want ordered (Java sample). All messages with the same ordering key are delivered to subscribers in the order in which they were received by the service. Note that Dataflow cannot yet take advantage of this feature.

Deduplication would still have to be done by the client, though it should be easier with ordered delivery since you can more easily track which messages have already been delivered. If you have no tolerance to duplicates, then you may need to store the list of messages you have processed (or the most recent message you have processed) persistently so you can detect duplicate messages and discard them.

A simple class that implements the MessageReceiver class that does only in-memory deduplication may look something like this:

public class DedupingSubscriber implements MessageReceiver {
  ConcurrentMap<String, Long> mostRecentPerKey = new ConcurrentHashMap<>();

  @Override
  void receiveMessage(PubsubMessage message, AckReplyConsumer consumer) {
    Long keyTime = mostRecentPerKey.get(message.getOrderingKey());
    Long messageTime = Timestamps.toNanos(message.getPublishTime());

    if (keyTime != null && messageTime.compareTo(keyTime) > 0) {
      // This message has not been processed.
      // processMessage(message); // Do what needs to be done with message.
      mostRecentPerKey.put(message.getOrderingKey(), messageTime);
    }
    consumer.ack();
  }

}
Kamal Aboul-Hosn
  • 15,111
  • 1
  • 34
  • 46