Writing to Bigtable from Python

Question

I am developing an IoT data pipeline using Python and Bigtable, and writes are desperately slow.

I have tried both Python client libraries offered by Google. The native API implements a Row class with a commit method. By iteratively committing rows in that way from my local development machine, the write performance on a production instance with 3 nodes is roughly 15 writes / 70 KB per second --granted, the writes are hitting a single node because of the way my test data is batched, and the data is being uploaded from a local network... However Google represents 10,000 writes per second per node and the upload speed from my machine is 30 MB/s, so clearly the gap lies elsewhere.

I have subsequently tried the happybase API with much hope because the interface provides a Batch class for inserting data. However, after disappointingly hitting the same performance limit, I realized that the happybase API is nothing more than a wrapper around the native API, and the Batch class simply commits rows iteratively in very much the same way as my original implementation.

What am I missing?

You're really not missing anything. There is work underway to support Cloud Bigtable's bulk mutation API in the python client: https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2411. The other advice I can give you is to do as much work in parallel as possible. Multiple threads/processes will let you scale linearly for quite a while given the performance you're seeing so far. — Gary Elliott, Apr 06 '17 at 20:42
@GaryElliott thank you for the reassurance and guidance! I've implemented a thread pool and I do get linear improvements but it tapers off at ~15 threads, yielding 200 writes/seconds. Beyond that no improvement. Is that what you would expect, and if so why is there still such a gap from the purported performance? — JD Margulici, Apr 09 '17 at 18:02
No that load wouldn't tax bigtable much, I would start to suspect a client-side/application bottleneck at this point (locking?) Unless all your writes are going to the same row, in which case the writes would be serialized in bigtable. — Gary Elliott, Apr 11 '17 at 19:01
Are you directly writing to Bigtable or using something like [OpenTSDB](http://opentsdb.net/) to abstract time series for you? If the former, please read the [time series schema design docs](https://cloud.google.com/bigtable/docs/schema-design-time-series) and post more information about your schema; you could be using row key schema which puts all writes on a single node. Also, if you are doing a large batch load into an empty table, you should pre-split your table for faster performance; Bigtable will take over from there, but you need initial splits to distribute key ranges. — Misha Brukman, Apr 30 '17 at 18:51

score 2 · Answer 1 · answered Jan 06 '19 at 07:23

2

I know I'm late to this question, but for anyone else who comes across this, the google cloud libraries for python now allow bulk writes with mutations_batcher. Link to the documentation.

You can use batcher.mutate_rows and then batcher.flush to send all rows to be updated in one network call, avoiding the iterative row commits.

answered Jan 06 '19 at 07:23

rohanphadte

978
6
19

I have to write around 50Mn rows with 50 odd columns. It is taking around 5 mins for 10K rows. How to speed up this process? Please help me out, I have just started my programming career – SK Singh Nov 30 '20 at 06:30
Bigtable should be 10K rows/s if each row has ~1KB of data... the google docs provides a list of possible slowdowns in performance here: https://cloud.google.com/bigtable/docs/performance#slower-perf – rohanphadte Nov 30 '20 at 06:39

Writing to Bigtable from Python

1 Answers1