I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.
[742461 rows x 248 columns]
Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.
Now I've tried:
- Reading the parquet using pyArrow Parquet (read_table) and then casting it to pandas (reading into table is immediate, but using to_pandas takes 3s)
- Playing around with pretty much every setting of to_pandas I can think of in pyarrow/parquet
- Reading it using pd.from_parquet
- Reading it from Plasma memory store (https://arrow.apache.org/docs/python/plasma.html) and converting to pandas. Again, reading is immediate but to_pandas takes time.
- Casting all strings as categories
Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.
My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.
Regards, Niklas