1

I have a large table (external to BigQuery as the data is in Google Cloud Storage). I want to scan the table using BigQuery to a client machine. For throughput, I fetch multiple streams concurrently in multiple threads.

From all I can tell, concurrency is not working. There's actually some penalty when using multiple threads.


import concurrent.futures
import logging
import queue
import threading
import time

from google.cloud.bigquery_storage import types
from google.cloud import bigquery_storage

PROJECT_ID = 'abc'
CREDENTIALS = {....}


def main():
    table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')

    requested_session = types.ReadSession()
    requested_session.table = table
    requested_session.data_format = types.DataFormat.AVRO
    requested_session.read_options.selected_fields = ["a", "b"]
    requested_session.read_options

    client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
    session = client.create_read_session(
        parent="projects/{}".format(PROJECT_ID),
        read_session=requested_session,
        max_stream_count=0,
    )

    if not session.streams:
        return

    n_streams = len(session.streams)
    print("Total streams", n_streams)  # this prints 1000

    q_out = queue.Queue(1024)
    concurrency = 4

    with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
        tasks = [
            pool.submit(download_row,
                        client._transport.__class__,
                        client._transport._grpc_channel,
                        s.name,
                        q_out)
            for s in session.streams
        ]

        t0 = time.perf_counter()
        ntotal = 0
        ndone = 0
        while True:
            page = q_out.get()
            if page is None:
                ndone += 1
                if ndone == len(tasks):
                    break
            else:
                for row in page:
                    ntotal += 1
                    if ntotal % 10000 == 0:
                        qps = int(ntotal / (time.perf_counter() - t0))
                        print(f'QPS so far:  {qps}')

        for t in tasks:
            t.result()


def download_row(transport_cls, channel, stream_name, q_out):
    try:
        transport = transport_cls(channel=channel)
        client = bigquery_storage.BigQueryReadClient(
            transport=transport,
            )
        reader = client.read_rows(stream_name)
        for page in reader.rows().pages:
            q_out.put(page)
    finally:
        q_out.put(None)


if __name__ == '__main__':
    main()

Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. I've followed the advice to share a GRPC "channel" across the threads.

The data items are large. The QPS I got is roughly

150, concurrency=1
120, concurrency=2
140, concurrency=4

Each "page" contains about 200 rows.

Thoughts:

  1. BigQuery quota? I only saw request rate limit, and did not see limit on volume of data traffic per second. The quotas do not appear to be limiting for my case.

  2. BigQuery server side options? Doesn't seem to be relevant. BigQuery should accept concurrent requests with enough capability.

  3. GPRC usage? I think this is the main direction for digging. But I don't know what's wrong in my code.

Can anyone shed some light on this? Thanks.

Veysel Olgun
  • 552
  • 1
  • 3
  • 15
zpz
  • 354
  • 1
  • 3
  • 16
  • 1
    Do your CPU or network utilization metrics change with concurrency changes? Is either of them maxed out? one reason to consider multiprocessing is to open up more channels to the server. Last time I [experimented with this more channels helped](https://github.com/googleapis/python-bigquery-storage/pull/47). I would guess your primary problem is the gil/python slowness as noted below. The part that locks the GIL and generally slow is the 'for row in page'. It would be better to try to measure throughput through page size. Or at least bulk convert to arrow and measure page size that way – Micah Kornfield Aug 18 '22 at 07:43
  • I did use spawned mp with multiple channels, and async as well, and got similar results. I also skipped unpacking page to rows, and got slightly different results. – zpz Aug 18 '22 at 13:45
  • Network maxing out is indeed something I should check. I did not simply because I'm not familiar with that. Also, I wasn't eager to check that because another application got much higher throughout (not totally comparable, so I should check further as well). – zpz Aug 18 '22 at 13:52
  • I also experimented with fetching a much smaller field, and got much higher count throughout. Concurrency didn't help there either; it harmed. The puzzle is why concurrency doesn't help. A natural hypothesis is that concurrency does not happen. It's hard to believe that this grpc single thread sequential use will max out my network capacity hence concurrency does not help. – zpz Aug 18 '22 at 14:00
  • I think sequential fetch can't max out network capacity, almost "by definition". – zpz Aug 18 '22 at 14:16
  • What is your link capacity? I mades changes to your script: in the download function the loop is "for page in reader" and use a new client per download (this still implicitly shares a channel). In main loop change the increment of ntotal to be "ntotal += page.row_count". When I run against wikipedia public data-set: "projects/{}/datasets/{}/tables/{}".format('bigquery-public-data', 'samples', 'wikipedia') I get 60 MB/s for a single stream for two streams it more or less doubles to 120 MB/s on a GCP e2-medium host. at 120 MB/s you are pretty much maxing out a gigabit link. – Micah Kornfield Aug 18 '22 at 16:40
  • `for page in reader` seems to skip row parsing, so user needs to do their own avro parsing. Your change to `ntoal` increment is what I meant by "skip unpacking the page" which I tried. Whats your exact code for "use a new client"? Guess it's one of the ways I tried. I tried passing in transport/channel to construct client; passing in the client directly (sharing even the stub; not sure about it's impact); construct client from scratch (I believe this does not share channel) – zpz Aug 19 '22 at 03:58
  • It does skip row parsing. Row parsing will cause GIL contention and in python for Avro is typically slower than the service can ship data to the client (arrow is much faster but converting to rows is slow) There is a channel cache (that is what the linked PR tries to work around) so new clients is a reason herring, I shouldn't have mentioned it – Micah Kornfield Aug 19 '22 at 15:38
  • A short summary here is there are a few bottlenecks that you could be hitting here and only seeing metrics on your code box can determine which: 1. Network saturation. 2. Channel saturation 3. GIL contention 4. CPU saturation. If you aren't seeing improved throughput on a 1 gbps link when moving from 1 thread to two threads it is likely the GIL causing you issues. Past 2 threads it more likely one of the first 2 – Micah Kornfield Aug 19 '22 at 15:45
  • Previous tests were on laptop from home behind VPN. Now I logged into a cloud VM from the same laptop. It definitely has better network bandwidth. Some behavior has changed. Will report details later. Summy: 1. Recreate `client` from scratch in threads brings about concurrency, whereas explicitly sharing "channel" disables concurrency. I don't know grpc trickery here. 2. Page parsing is significant, but not dominant overhead – zpz Aug 20 '22 at 17:26
  • I'm trying to test multiprocessing to see how it interacts with the page parse overhead. I "spawn" processes and create "client" in each process using credentials---no apparent sharing of transport or channel. However it is stuck with no feedback. I remember reading about some "fork" hack about grpc but I suppose spawned processes are completely clean slates. Must be wrong. This kind of lib behavior is annoying. – zpz Aug 22 '22 at 21:44
  • https://github.com/grpc/grpc/blob/master/doc/fork_support.md might be helpful – Micah Kornfield Aug 23 '22 at 03:37
  • Yeah this thing is ...not pretty. I did not go down that route. I figured out that if u close up the grpc thing cleanly in the parent process before creating child processes, it can work. In this case, I use the context manager of `client` to do that. Then somehow `concurrent.futures.ProcessExecutor` does not work; raw mp processes work. In one process across threads, just create `client` from scratch everytime. Don't explicitly share channel and such. Grpc will handle the sharing on your back, it seems. – zpz Aug 23 '22 at 16:53
  • I think I got the hang of concurrency now. With threading, queue.Queue, and skipping page parse (just get the page size and fake the rows), I get 10000+ rows per sec. But if I parse the page to rows, it brings down qps to maybe 2000. I tried to use `np` to separate some computations, but the overhead of mp, mp queue, and avro parsing is high that it's still not usable for me yet. – zpz Aug 23 '22 at 17:57
  • Tried to use mp, not np – zpz Aug 23 '22 at 17:58
  • Depending on what your final goal is using Arrow as a serialization format can help alot. Converting arrow to python rows is still pretty slow, it has been something that I wanted to improve on. – Micah Kornfield Aug 23 '22 at 20:58
  • I will look I to arrow. I believe eventually I need rows; I'm not using the row batches in a columnar way. I don't know arrow, hence can't say whether it will meet my need without been turned to rows. I wonder whether arrow can get the page across process boundaries with zero copy and no queue (hence no pickling). I'm thinking of fetching the data in one process with as little other op as possible, and do parsing and other conversions in other processes. Currently, putting the page in a queue in the data fetching process seems to interfere with fetching, besides its own large overhead. – zpz Aug 24 '22 at 02:30
  • It might be best to summarize findings for this question and maybe open up a more detailed YAQs for the best way to use multiprocessing to achieve your goal. I think the original questions on throughput has been answered at this point? It might be possible to use shared memory to reduce pickle overhead but it isn't clear without seeing code that is what the issue is. – Micah Kornfield Aug 24 '22 at 06:06

1 Answers1

2

Python threads do not run in parallel because of the GIL.

You are creating threads, and not multiprocesses. And by definition Python is single core because of GIL.

ThreadPoolExecutor has been available since Python 3.2, it is not widely used, perhaps because of misunderstandings of the capabilities and limitations of Threads in Python. This is enforced by the Global Interpreter Lock ("GIL"). More

Look into using multiprocessing module, a good read is here.

UPDATE:

Also in your code you need one more param: requested_streams

n_streams = 2
session = client.create_read_session(
    table_ref,
    parent,
    requested_streams=n_streams,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • Thanks for looking into my question. But this is not a problem. You need to read up on Python concurrency, threading, multiprocessing, GIL, I/O bound, CPU bound. `requested_streams` is a parameter in older versions. My post says it returned 1000 streams. – zpz Aug 17 '22 at 22:54
  • requested_streams acts a maximum number of streams if it is unset the API will select a number that makes sense automatically based on table size. – Micah Kornfield Aug 18 '22 at 15:48