1

I have a dataflow pipeline that collects user data like navigation, purchases, crud actions etc. I have this requirement to be able to identify patterns real time and then dispatch pubsub events that other services can listen to in order to provide the user real time tips, offers or promotions.

I'm thinking to start grouping the events by user id and then if the match a pattern to create a PCollection that contains the events names that need to be triggered via pubsub.

Is this the right approach? Is there a better way?

chchrist
  • 18,854
  • 11
  • 48
  • 82

1 Answers1

2

This could certainly work for some use cases.

If you are using session based windowing in combination with early firings (triggering upon arrival of each element). You can have all the data needed to identify patterns each time a new element arrives.

However, depending on the rate of user data being pushed and the size of the session, this might result in holding a lot of data in the PCollection and repeating this pattern matching a lot (on the same data), since you have to reuse all the data in the session. Furthermore you cannot use elements that arrived before this session.

Sometimes, you might be better off by keeping a state for each user (without redoing the pattern matching on all the data of the user for this session). Using a state would in fact remove the need to work with windowing. The new process would now look like this:

For each element that arrives:

  1. Fetch the current state

  2. Calculate the new state (based on the old state and the new element)

  3. If needed, emit a message to PubSub.

To hold your state, you could use BigTable or Datastore.

Fematich
  • 1,588
  • 14
  • 26
  • Yes, session windows will hold lots of data and running the same patterns over and over again might be very costly... For getting the state from BigTable or Datastore I'm worried about speed but I think it worths checking. – chchrist Aug 25 '16 at 21:49
  • Bigtable can scale. Single digit millisecond response time is the norm when operations are all done within the same cloud zone. There are also bulk operations to improve throughput. We can discuss all of that at your leisure. – Solomon Duskis Aug 29 '16 at 19:31
  • haven't played with Bigtable yet. I will give it a go. I'd like to discuss it with you. – chchrist Aug 30 '16 at 09:48
  • Feel free to reach out on the google-cloud-bigtable-discuss google group, as described in: https://cloud.google.com/bigtable/docs/support – Solomon Duskis Aug 30 '16 at 17:01