3

Bigtable row key scenario to avoid hotspotting?

A company needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

  • A. Rowkey: date#device_id, Column data: data_point
  • B. Rowkey: date, Column data: device_id, data_point
  • C. Rowkey: device_id, Column data: date, data_point
  • D. Rowkey: data_point, Column data: device_id, date
  • E. Rowkey: date#data_point, Column data: device_id

What would be the best option in above?

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
Roshan Fernando
  • 493
  • 11
  • 31
  • Is this a homework question? Please see [How do I ask and answer homework questions?](https://meta.stackoverflow.com/q/334822) for more information. If so, please annotate it as such. What do you think it should be and why? Have you looked at the docs on Bigtable schema design? – Misha Brukman Dec 22 '18 at 23:31
  • Thank you. Yes, I am aware of the norms as per the Big table documentation. But this question's answer options were confusing. – Roshan Fernando Dec 23 '18 at 01:49
  • 1
    If you've read the schema design documentation, you should be able to eliminate some obviously bad choices from the list, leaving a few, and then you can decide which is the best from the remaining set, given the most frequent types of queries you want to run — for each schema and for each query (here there's only 1 such query), consider how you can implement such a query, and see which schema provides the most efficient implementation of such a query. Hope this helps. – Misha Brukman Dec 23 '18 at 20:50
  • 1
    Yes, it cannot be A, B and E due to it starting with Date, which could cause lexicographical sort filling each node. I am confused with C or D option here. Just deviceid could cause multiple versions of Column data for date and data point. Having data_point as row key, i am not sure if that could be really be queryable. Can you provide inputs on which out of C or D would be a best option. [I am aware a row key of deviceid#date would be best option here, but would like to know the best from the available options above] – Roshan Fernando Jan 30 '19 at 05:17
  • Bigtable has built-in cell versionining: every value in a column has an associated 64-bit timestamp, so you can write multiple data points, keyed by time, within a single row, for a single column. Does that help? – Misha Brukman Jan 30 '19 at 17:47
  • Note that there are additional (more advanced considerations) here, e.g., you don't want to store an unbounded amount of data in a single row, and split large amounts of data across multiple rows. If you ignore that consideration, there's a single best answer among these. If you include that consideration, there's a better answer than any listed here. Please connect me with your professor if they're interested in discussing this further; we're happy to help further with this! – Misha Brukman Jan 30 '19 at 17:47
  • @RoshanFernando I believe A is the right option. It has a good row key design. Maybe not the best as it starts with the timestamp. But seems better than the other row keys. I don't ilke option C as it could lead to a large row size as you don't know how many dates the measuring will be running for. Also, C gives little flexibility for new queries. At last, is mentioned as anti pattern (https://cloud.google.com/bigtable/docs/schema-design-time-series#use_tall_and_narrow_tables) – Gabriel Feb 27 '19 at 17:40
  • This could be usefull. It is a very similar problem. https://cloud.google.com/bigtable/docs/schema-design-time-series#intelligent_energy_meters_internet_of_things – Gabriel Feb 27 '19 at 20:52

2 Answers2

5

According to the Bigtable schema documentation:

Rows are sorted lexicographically by row key.

This means that in order to avoid hotspotting, common queries should return row results that sequential.

Essentially, you want to be querying rows with a given date and device id. Google Cloud Bigtable allows you query rows by a certain row key prefix. Since the most common queries all the data for a given device and date, the device and date need to be part of the row prefix query, and must be the first two entries in a row key.

rohanphadte
  • 978
  • 6
  • 19
  • `the device and data need to be part of the row prefix query,` should that be `the device and date`, not `data`? I think if you're querying you have the device and date so you can compute the row key. – nomadic_squirrel Nov 30 '20 at 00:22
  • you're correct - I had meant to write date. Just edited the post. :) – rohanphadte Nov 30 '20 at 00:37
0

You have 2 kind of solution. Big Table make a Lexigoraphy dictionary using the rowkeys enforcing the organization about

1 - Add before each rowkey(prefix) a letter to force the Big Table make a lexicography index spreading your rows across the alphabet letters and avoid a colision during i/o. This technique is called Salted Table.

Ex.

123 456 789 101112 131415

a123 a456 b789 b101112 c131415

2- You can use a MD5 Hash, avoiding repeat the prefix before hasing and this way garantee a variety of prefix and this way The Big Table spreads the rowkeys across the instance´s disk.