Bigquery Python API multithreading doesn't improve performance

Question

I am using python BigQuery API to list the table content and then process the given JSON response.

100,000 record with main thread takes approximately 30 seconds
10,000 record with main thread takes approximately 4 seconds
100,000 records with 10 threads or 5 takes approximately 20 seconds.

other useful information.

for 100,000 records with main thread.

List item fetching result (REST calls) - 25 seconds
parse result - 5 seconds
write result - 2 seconds

Shouldn't it take less time?

Would someone please let me what is causing this performance lag?

def _get_values(val):
    if isinstance(val, datetime.datetime):
        return str(val)
    else:
        return val

def map_schema(row):
    row_dict = {}
    values = row.values()
    field_to_index = row._xxx_field_to_index
    for field, index in field_to_index.iteritems():
        row_dict[str(field)] = _get_values(values[index])
    return row_dict

def write_json(file, row):
    file.write(json.dumps(row))


def _save_rows(table, start_index, max_row, file):
    rows = client.list_rows(table, max_results=max_row, start_index=start_index)
    for row in rows:
        processedRow = map_schema(row)
        write_json(file, processedRow)

def run():
    threads = []
    dataset_ref = client.dataset('hacker_news', project='bigquery-public-data')
    table_ref = dataset_ref.table('comments')
    table = client.get_table(table_ref)  # API call
    import time
    start = time.time()
    output_file = open("temp_t.json", "a")

    total_rows = 100000
    total_threads = 10
    max_row = total_rows/total_threads

    # 10 threads takes ~ 20 seconds
    # 5 threads takes the same ~ 20 seconds
    files = []
    for index in range(0, total_rows, max_row):
        file_name = "%s.json" % index
        files.append(open(file_name, "a"))
        threads.append(threading.Thread(target=_save_rows, args=(table, index, max_row, output_file)))

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    for file in files:
        file.close()


    # takes ~ 30 seconds
    # _save_rows(table, 0, 100000, output_file)

    # takes ~ 4 seconds
    # _save_rows(table, 0, 10000, output_file)

    output_file.close()
    print "total time = %f" % (time.time() - start)

run()

There are lots of things that could be causing this, but read-up on the GIL before looking at anything else: https://wiki.python.org/moin/GlobalInterpreterLock — cdarke, Jun 27 '18 at 12:50
@cdarke thanks, the page says `Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL` so writing to file could not be bottleneck, could you be more specific about which part could be an issue? — Gaurang Shah, Jun 27 '18 at 12:56
Anything that executes python byte code will have to grab the GIL. Of course that is not the only possible bottleneck, disk queueing can also be an issue - anything that there is only "one of". Try fewer threads, you might get better results. I have had situations where a single thread is faster than even two because of the locking overhead, but it depends on the detail of exactly what the code is doing. — cdarke, Jun 27 '18 at 13:06
`client.list_rows` rows is going to send `rest call` to get rows from `google BigQuery table` rest is writing to disc. — Gaurang Shah, Jun 27 '18 at 13:14
Did you measure single thread execution for 10K records? As well as detail split between actual read and write to file? — Alexey Maloletkin, Jun 27 '18 at 13:46
@AlexeyMaloletkin for `10000` rows with `single thread` it takes `448 ms` to parse result, `125 ms` to write result to json and `4488 ms` to fetch result from bigquery. total time is `5068 ms` — Gaurang Shah, Jun 27 '18 at 14:15
Interesting - what I noticed - you write results to the same file - so file system lock potentially in play. You may try to experiment a little here - option 1: run this code with file write commented out and see timing, option 2: write to different files - and combine them in one as final step, 3: get result into memory from all thread and dump result at the end when all threads completed — Alexey Maloletkin, Jun 27 '18 at 15:02
@AlexeyMaloletkin I tried creating a separate file for each thread, however that's not helping either. Could not store the result into memory as it's really big. — Gaurang Shah, Jun 27 '18 at 15:10
Hm - could you try to change this code: `for row in rows: processedRow = map_schema(row) write_json(file, processedRow)` and do write to file not every one record but every say 100 or 1000 - I still thinking issue not with bq but with file writing — Alexey Maloletkin, Jun 27 '18 at 16:23
@AlexeyMaloletkin thank's for the suggestion. I check with having buffer of `100 and 1000` rows however, there is only `2 sec` improvement in main thread and with 10 threads. — Gaurang Shah, Jun 27 '18 at 16:52
@AlexeyMaloletkin most of the time is taken for iterating over rows. However `rows` iterator is seperate per thread. so not able to understand why it takes so much of time with 10 threads. — Gaurang Shah, Jun 27 '18 at 16:55
@AlexeyMaloletkin is this a python thing? if so, I could give it a try in java. — Gaurang Shah, Jun 28 '18 at 19:42
I tried to replicate your code but it seems it's not complete, it's missing `_get_values`, is there any chance to get the full code? I found in a BQ book (Google BigQuery Analytics) and its [repository](https://code.google.com/archive/p/bigquery-e2e/source/default/source) (ch12/table_reader.py) a piece of code that may be of help. Either way, I have the feeling that the sample size (100k) is not enough for seeing a huge difference (30s seems quite low), and maybe the threads take more time communicating than actually executing the code. Can you try it with a bigger sample? — Iñigo, Jul 17 '18 at 11:25
I've run some tests with your code using different computer settings and number or rows/threads (up to 8M) and they seem to confirm what I though: if you increase the number of rows, the performance gets better (percentage). You can see here a [table with my test](https://i.stack.imgur.com/eP1UR.png). In any case, I recommend you to take a look at the code found in the book I told you about, I would say it's better optimised. — Iñigo, Jul 18 '18 at 15:30

Bigquery Python API multithreading doesn't improve performance

0 Answers0