Use case:
I am using Google BigTable to store counts like this:
| rowkey | columnfamily |
| | col1 | col2 | col3 |
|----------|------|------|------|
| row1 | 1 | 2 | 3 |
| row2 | 2 | 4 | 8 |
| row3 | 3 | 3 | 3 |
I want to read all rows for a given range of row keys (let's assume all in this case) and aggregate the values per column.
A naive implementation would query the rows and iterate over the rows while aggregating the counts, like this:
from google.cloud.bigtable import Client
instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')
col1_sum = 0
col2_sum = 0
col3_max = 0
table.read_rows()
row_data.consume_all()
for row in row_data.rows:
col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
col3_max = col3_value if col3_value > col3_max else col3_max
Question:
Is there a way to efficiently load the resulting rows in a pandas DataFrame and leverage pandas performance to do the aggregation?
I would like to avoid the for loop for computing the aggregates as it is known to be very inefficient.
I am aware of the Apache Arrow project and its python bindings and although HBase is mentioned as a backing project (and Google BigTable is advertised as being very similar to HBase) I can't seem to find a way to use it for the use case I described here.