4

I would create my optimal key in Bigtable. I know the key design is crucial for query speed and optimization. My case is related to a time series information from a network machine. It is a tall and narrow table with 3 columns: id, info and datetime.

My most frequent query is to get all info for each id for each day.

How should the key be designed to obtain the best performances? id#date?

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
lordcenzin
  • 339
  • 3
  • 15

2 Answers2

7

Disclosure: I lead product management for Google Cloud Bigtable.

My case is related to a time series information from a network machine. it is a tall and narrow table with 3 columns: id, info and datetime.

Given that the id is in the row key, I am not sure if you need a separate id column.

Similarly, can you please clarify why you need to have datetime as a separate column? Note that each value in Cloud Bigtable has an associated timestamp, so you don't need to store a separate date/time in a separate column.

My most frequent query is to get all info for each id for each day.

How should the key be designed to obtain the best performances? id#date?

My recommendation would be to do as you suggested: id#date as the row key, and store all the data for that date within a single row, using the timestamp of each cell value to differentiate it, so that you can get the exact timestamp of each reading.

As per above, I think you can drop both the id and datetime columns, and accomplish this use case with just a single column for the table.

Best of luck with your project; please let us know how it goes!

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
  • Let's say, I have used `id#date` as the row key. And `hoursWorked` as another only column. I could easily get "Number of hours worked on a particular day". How can I meet future requirements like "Number of hrs worked over last 7 days/30 days/ lifetime" and keeping latencies & performances in-check? Any recommendations on implementing these kinda scenarios would be appreciated. – tezz Jun 30 '20 at 06:10
  • @tezz – You're asking a different question than the one I answered; please create a new question. Comments are for clarification of existing answer, not for follow-up / additional questions or conversation. You can refer to this question or answer with a link from your question for context if needed. Thanks and all the best! – Misha Brukman Jun 30 '20 at 20:39
1

This is a complex topic that depends on the amount of data you expect to receive, the structure of the ids, the number of ids, etc.

The Cloud Bigtable documentation has a page covering this at length:

https://cloud.google.com/bigtable/docs/schema-design-time-series

which is probably better than can be provided here.

How should the key be designed to obtain the best performances? id#date?

Assuming you have a lot of machines, and their ids are reasonably different (so they do not hot spot), then maybe id#date#time. You can query for a id and date using a range, which should be very efficient.

coryan
  • 723
  • 3
  • 5