Efficiently reading bigquery query results to pandas.DataFrame with "category" dtype

Question

I have some category / group label fields in bigquery tables stored as STRING fields. Unfortunately some of these fields have pretty long (>20 character) string values.

My question is: what's the most memory and io efficient way to read these fields from bigquery to a pandas.DataFrame (or pyarrow.Table)?

Should I change the schema of my bigquery tables? (is there a categorical data type for this use case? cardinality is between 1000 and 20000 for the labels)

Should I change how I call client.query(...).result().to_dataframe(...)? Currently I have client.query(query_text).result().to_dataframe(dtypes={"column_name": "category"}). But, the inclusion of dtypes in to_dataframe doesn't seem to help memory usage.

Some additional details:

the table...
- one STRING field with ~20,000 values
- one STRING field with ~10,000 values
- an integer field
- a row for every combination of the two STRING fields
- around 200,000,000 rows
- "table size" says 7.24 GB
querying in python
- takes 1m30s to execute client.query(...).result().to_dataframe(dtypes...)
- memory usage peaks at 18GB
- resulting DataFrame is 3GB
fun fact: a CSV file with this data is only 300MB! (the data are pivoted)

Have you seen this [post](https://stackoverflow.com/questions/59749359/fetching-data-from-bigquery-taking-very-long) regarding on speeding up the fetching of data? You might want to benchmark the memory usage when you try this and see if it affects the memory usage as well. If ever you have a question on the memory side of things it might be better to [create an issue in bigquery github](https://github.com/googleapis/python-bigquery/issues). — Ricco D, Mar 29 '22 at 03:10

Efficiently reading bigquery query results to pandas.DataFrame with "category" dtype

0 Answers0