2

I am using python BigQuery API to list the table content and then process the given JSON response.

  • 100,000 record with main thread takes approximately 30 seconds
  • 10,000 record with main thread takes approximately 4 seconds
  • 100,000 records with 10 threads or 5 takes approximately 20 seconds.

other useful information.

for 100,000 records with main thread.

  • List item fetching result (REST calls) - 25 seconds
  • parse result - 5 seconds
  • write result - 2 seconds

Shouldn't it take less time?

Would someone please let me what is causing this performance lag?

def _get_values(val):
    if isinstance(val, datetime.datetime):
        return str(val)
    else:
        return val

def map_schema(row):
    row_dict = {}
    values = row.values()
    field_to_index = row._xxx_field_to_index
    for field, index in field_to_index.iteritems():
        row_dict[str(field)] = _get_values(values[index])
    return row_dict

def write_json(file, row):
    file.write(json.dumps(row))


def _save_rows(table, start_index, max_row, file):
    rows = client.list_rows(table, max_results=max_row, start_index=start_index)
    for row in rows:
        processedRow = map_schema(row)
        write_json(file, processedRow)

def run():
    threads = []
    dataset_ref = client.dataset('hacker_news', project='bigquery-public-data')
    table_ref = dataset_ref.table('comments')
    table = client.get_table(table_ref)  # API call
    import time
    start = time.time()
    output_file = open("temp_t.json", "a")

    total_rows = 100000
    total_threads = 10
    max_row = total_rows/total_threads

    # 10 threads takes ~ 20 seconds
    # 5 threads takes the same ~ 20 seconds
    files = []
    for index in range(0, total_rows, max_row):
        file_name = "%s.json" % index
        files.append(open(file_name, "a"))
        threads.append(threading.Thread(target=_save_rows, args=(table, index, max_row, output_file)))

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    for file in files:
        file.close()


    # takes ~ 30 seconds
    # _save_rows(table, 0, 100000, output_file)

    # takes ~ 4 seconds
    # _save_rows(table, 0, 10000, output_file)

    output_file.close()
    print "total time = %f" % (time.time() - start)

run()
Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • There are lots of things that could be causing this, but read-up on the GIL before looking at anything else: https://wiki.python.org/moin/GlobalInterpreterLock – cdarke Jun 27 '18 at 12:50
  • @cdarke thanks, the page says `Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL` so writing to file could not be bottleneck, could you be more specific about which part could be an issue? – Gaurang Shah Jun 27 '18 at 12:56
  • Anything that executes python byte code will have to grab the GIL. Of course that is not the only possible bottleneck, disk queueing can also be an issue - anything that there is only "one of". Try fewer threads, you might get better results. I have had situations where a single thread is faster than even two because of the locking overhead, but it depends on the detail of exactly what the code is doing. – cdarke Jun 27 '18 at 13:06
  • `client.list_rows` rows is going to send `rest call` to get rows from `google BigQuery table` rest is writing to disc. – Gaurang Shah Jun 27 '18 at 13:14
  • Did you measure single thread execution for 10K records? As well as detail split between actual read and write to file? – Alexey Maloletkin Jun 27 '18 at 13:46
  • @AlexeyMaloletkin for `10000` rows with `single thread` it takes `448 ms` to parse result, `125 ms` to write result to json and `4488 ms` to fetch result from bigquery. total time is `5068 ms` – Gaurang Shah Jun 27 '18 at 14:15
  • Interesting - what I noticed - you write results to the same file - so file system lock potentially in play. You may try to experiment a little here - option 1: run this code with file write commented out and see timing, option 2: write to different files - and combine them in one as final step, 3: get result into memory from all thread and dump result at the end when all threads completed – Alexey Maloletkin Jun 27 '18 at 15:02
  • @AlexeyMaloletkin I tried creating a separate file for each thread, however that's not helping either. Could not store the result into memory as it's really big. – Gaurang Shah Jun 27 '18 at 15:10
  • Hm - could you try to change this code: `for row in rows: processedRow = map_schema(row) write_json(file, processedRow)` and do write to file not every one record but every say 100 or 1000 - I still thinking issue not with bq but with file writing – Alexey Maloletkin Jun 27 '18 at 16:23
  • @AlexeyMaloletkin thank's for the suggestion. I check with having buffer of `100 and 1000` rows however, there is only `2 sec` improvement in main thread and with 10 threads. – Gaurang Shah Jun 27 '18 at 16:52
  • @AlexeyMaloletkin most of the time is taken for iterating over rows. However `rows` iterator is seperate per thread. so not able to understand why it takes so much of time with 10 threads. – Gaurang Shah Jun 27 '18 at 16:55
  • @AlexeyMaloletkin is this a python thing? if so, I could give it a try in java. – Gaurang Shah Jun 28 '18 at 19:42
  • I tried to replicate your code but it seems it's not complete, it's missing `_get_values`, is there any chance to get the full code? I found in a BQ book (Google BigQuery Analytics) and its [repository](https://code.google.com/archive/p/bigquery-e2e/source/default/source) (ch12/table_reader.py) a piece of code that may be of help. Either way, I have the feeling that the sample size (100k) is not enough for seeing a huge difference (30s seems quite low), and maybe the threads take more time communicating than actually executing the code. Can you try it with a bigger sample? – Iñigo Jul 17 '18 at 11:25
  • @Iñigo I have updated the question. – Gaurang Shah Jul 17 '18 at 14:35
  • I've run some tests with your code using different computer settings and number or rows/threads (up to 8M) and they seem to confirm what I though: if you increase the number of rows, the performance gets better (percentage). You can see here a [table with my test](https://i.stack.imgur.com/eP1UR.png). In any case, I recommend you to take a look at the code found in the book I told you about, I would say it's better optimised. – Iñigo Jul 18 '18 at 15:30

0 Answers0