0

I am a newbie, I just started a query where I have ~1 million rows on bigquery and it has 25 columns. Rows have the type is RowIterator

I wrote a script in Python to loop them and process data. I used:

client = bigquery.Client()
query_job = client.query(query)
rows = query_job.result()    (~1 million records)
df = rows.to_dataframe()    (*)
dict_rows = df.to_dict(orient="records")
for row in dict_rows:
    # process data

(*) which takes around 5-6 minutes. This is not good for me.

Any suggestions on how can I process it faster? Thanks.

D9SeveN
  • 1
  • 2
  • 3
    Looping in python is really slow. Can you provide more info on how do you mean to process your data? There may be several ways, but the two things that are important is that, if you have a large dataset and want to loop in python, pre-process the data smartly or customize it for faster execution (not necessarily for every case). Use comprehension, if you can, as they are faster. – Sam Aug 21 '23 at 04:01
  • 1
    You created a dataframe which is optimized for bulk operations. For instance, if you had a column named "foo" and a strong desire to add `1` to each value, it would be `df["foo"] + 1`. No need to create records or iterate. The comment above is a good one. What would you like to do? – tdelaney Aug 21 '23 at 04:25

1 Answers1

0

How you are processing the result row is relevant, without details, not loading to data frame and converting to dictionary is a start, if raw rows can be processed directly in iterator.

for row in query_job.result():
    #Process Data

Or with a generator comprehension or genertor function.

gerator_comprehension_result = (#Process row for row in query_job.result())

def process_data(data):
     for row in data:
          #Process row
          yield row_modified

generator_function_result = process_data(query_job.result())