5

Use case:

I am using Google BigTable to store counts like this:

|  rowkey  |    columnfamily    |
|          | col1 | col2 | col3 |
|----------|------|------|------|
| row1     | 1    | 2    | 3    |
| row2     | 2    | 4    | 8    |
| row3     | 3    | 3    | 3    |

I want to read all rows for a given range of row keys (let's assume all in this case) and aggregate the values per column.

A naive implementation would query the rows and iterate over the rows while aggregating the counts, like this:

from google.cloud.bigtable import Client

instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')

col1_sum = 0
col2_sum = 0
col3_max = 0

table.read_rows()
row_data.consume_all()

for row in row_data.rows:
    col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
    col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
    col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
    col3_max = col3_value if col3_value > col3_max else col3_max

Question:

Is there a way to efficiently load the resulting rows in a pandas DataFrame and leverage pandas performance to do the aggregation?

I would like to avoid the for loop for computing the aggregates as it is known to be very inefficient.

I am aware of the Apache Arrow project and its python bindings and although HBase is mentioned as a backing project (and Google BigTable is advertised as being very similar to HBase) I can't seem to find a way to use it for the use case I described here.

bartaelterman
  • 795
  • 10
  • 26

4 Answers4

2

I don't believe there is an existing pandas interface to Cloud Bigtable, but this would be a nice project to build, similar to the BigQuery interface in https://github.com/pydata/pandas-gbq.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
2

After diving deeper in the BigTable mechanics, it seems that the python client executes a gRPC ReadRows call when you call table.read_rows(). That gRPC call returns a streaming response of rows in key order over HTTP/2 (see the docs).

If the API returns data per row it seems to me that the only useful way to consume that response is row based. There seems to be little use in trying to load that data in a columnar format to avoid having to loop over the rows.

bartaelterman
  • 795
  • 10
  • 26
1

You might be able to use pdhbase with google-cloud-happybase. If that doesn't work out of the box, you can perhaps get inspiration on how to perform the integration.

There is a Cloud Bigtable / BigQuery integration as well, which you might be able to integration with https://github.com/pydata/pandas-gbq (Thanks to Wes McKinney for that tip).

Solomon Duskis
  • 2,691
  • 16
  • 12
0

You can iterate through the BigTable rows and store the same in dictionary to merge it to dataframe. Below is example for fetching single row.

import pandas as pd
from google.cloud import bigtable

client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
row_key = "1234"
row = table.read_row(row_key)

dct={}
print("Reading data for {}:".format(row.row_key.decode("utf-8")))
for cf, cols in sorted(row.cells.items()):
    print("Column Family {}".format(cf))
    for col, cells in sorted(cols.items()):
        for cell in cells:
            labels = (
                " [{}]".format(",".join(cell.labels)) if len(cell.labels) else ""
            )
            dct[col.decode("utf-8")] = cell.value.decode("utf-8")
pd.DataFrame([dct])
RDP
  • 81
  • 2
  • 13