I have some category / group label fields in bigquery tables stored as STRING fields. Unfortunately some of these fields have pretty long (>20 character) string values.
My question is: what's the most memory and io efficient way to read these fields from bigquery to a pandas.DataFrame (or pyarrow.Table)?
Should I change the schema of my bigquery tables? (is there a categorical data type for this use case? cardinality is between 1000 and 20000 for the labels)
Should I change how I call client.query(...).result().to_dataframe(...)
? Currently I have client.query(query_text).result().to_dataframe(dtypes={"column_name": "category"})
. But, the inclusion of dtypes
in to_dataframe
doesn't seem to help memory usage.
Some additional details:
- the table...
- one STRING field with ~20,000 values
- one STRING field with ~10,000 values
- an integer field
- a row for every combination of the two STRING fields
- around 200,000,000 rows
- "table size" says 7.24 GB
- querying in python
- takes 1m30s to execute
client.query(...).result().to_dataframe(dtypes...)
- memory usage peaks at 18GB
- resulting DataFrame is 3GB
- takes 1m30s to execute
- fun fact: a CSV file with this data is only 300MB! (the data are pivoted)