0

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.

The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each

I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.

I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.

BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.

The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up. My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate

If anybody has a better suggestion I would love to get some input.

samaspin
  • 2,342
  • 1
  • 26
  • 31
  • Keep in mind that Pub/Sub does NOT guarantee that events will be consumed in order. "Also, though Pub/Sub usually delivers messages in order of publication, this is not guaranteed; it is possible for subscriptions to receive messages out of order" (https://cloud.google.com/pubsub/publisher) I can speak from experience that this order is not kept during continuous streaming. What kind of delay can you work with? You could add your ID in the publishing phase, save your results to cloud storage, and then sort your results during the querying phase. – Gil Adirim Mar 29 '16 at 15:10
  • Cheers Gil. The order the items are written to storage in isnt really an issue here provided the window was quite small. If things were getting written an hour or two out of order then this might be an issue. Events being written within a few minutes is the ideal, I could live with an hour or two delay in reality. You have given me an idea with the cloud storage approach - perhaps I could use a sequential integer id as the last part of the Cloud Storage key. I need to have a think about how that might work and what we would lose. – samaspin Mar 29 '16 at 16:14
  • 1
    Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting. If you could change your key design a bit. Take a look at the two sections here. (key design and avoid hotspotting) https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind – Les Vogel - Google DevRel Mar 29 '16 at 18:22

1 Answers1

1

We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables

As far as Cloud Bigtable, as noted by Les in the comments:

Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.

See: You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind

You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).

At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery: https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

Ramesh Dharan
  • 895
  • 4
  • 14