1

I am developing an IoT data pipeline using Python and Bigtable, and writes are desperately slow.

I have tried both Python client libraries offered by Google. The native API implements a Row class with a commit method. By iteratively committing rows in that way from my local development machine, the write performance on a production instance with 3 nodes is roughly 15 writes / 70 KB per second --granted, the writes are hitting a single node because of the way my test data is batched, and the data is being uploaded from a local network... However Google represents 10,000 writes per second per node and the upload speed from my machine is 30 MB/s, so clearly the gap lies elsewhere.

I have subsequently tried the happybase API with much hope because the interface provides a Batch class for inserting data. However, after disappointingly hitting the same performance limit, I realized that the happybase API is nothing more than a wrapper around the native API, and the Batch class simply commits rows iteratively in very much the same way as my original implementation.

What am I missing?

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78
JD Margulici
  • 965
  • 7
  • 8
  • 1
    You're really not missing anything. There is work underway to support Cloud Bigtable's bulk mutation API in the python client: https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2411. The other advice I can give you is to do as much work in parallel as possible. Multiple threads/processes will let you scale linearly for quite a while given the performance you're seeing so far. – Gary Elliott Apr 06 '17 at 20:42
  • @GaryElliott thank you for the reassurance and guidance! I've implemented a thread pool and I do get linear improvements but it tapers off at ~15 threads, yielding 200 writes/seconds. Beyond that no improvement. Is that what you would expect, and if so why is there still such a gap from the purported performance? – JD Margulici Apr 09 '17 at 18:02
  • 1
    No that load wouldn't tax bigtable much, I would start to suspect a client-side/application bottleneck at this point (locking?) Unless all your writes are going to the same row, in which case the writes would be serialized in bigtable. – Gary Elliott Apr 11 '17 at 19:01
  • Are you directly writing to Bigtable or using something like [OpenTSDB](http://opentsdb.net/) to abstract time series for you? If the former, please read the [time series schema design docs](https://cloud.google.com/bigtable/docs/schema-design-time-series) and post more information about your schema; you could be using row key schema which puts all writes on a single node. Also, if you are doing a large batch load into an empty table, you should pre-split your table for faster performance; Bigtable will take over from there, but you need initial splits to distribute key ranges. – Misha Brukman Apr 30 '17 at 18:51

1 Answers1

2

I know I'm late to this question, but for anyone else who comes across this, the google cloud libraries for python now allow bulk writes with mutations_batcher. Link to the documentation.

You can use batcher.mutate_rows and then batcher.flush to send all rows to be updated in one network call, avoiding the iterative row commits.

rohanphadte
  • 978
  • 6
  • 19
  • I have to write around 50Mn rows with 50 odd columns. It is taking around 5 mins for 10K rows. How to speed up this process? Please help me out, I have just started my programming career – SK Singh Nov 30 '20 at 06:30
  • Bigtable should be 10K rows/s if each row has ~1KB of data... the google docs provides a list of possible slowdowns in performance here: https://cloud.google.com/bigtable/docs/performance#slower-perf – rohanphadte Nov 30 '20 at 06:39