Speeding up PyArrow Parquet to Pandas for dataframe with lots of strings

Question

I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.

[742461 rows x 248 columns]

Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.

Now I've tried:

Reading the parquet using pyArrow Parquet (read_table) and then casting it to pandas (reading into table is immediate, but using to_pandas takes 3s)
Playing around with pretty much every setting of to_pandas I can think of in pyarrow/parquet
Reading it using pd.from_parquet
Reading it from Plasma memory store (https://arrow.apache.org/docs/python/plasma.html) and converting to pandas. Again, reading is immediate but to_pandas takes time.
Casting all strings as categories

Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.

My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.

Regards, Niklas

score 2 · Answer 1 · answered Jun 10 '20 at 13:11

In the end I reduced the time by more carefully handling the data, mainly by removing blank values, making sure we had as much NA values as possible (instead of blank strings etc) and making categories on all text data with less than 50% unique content.

I ended up generating the schemas via PyArrow so I could create categorical values with a custom index size (int64 instead of int16) so my categories could hold more values. The data size was reduces by 50% in the end.

Speeding up PyArrow Parquet to Pandas for dataframe with lots of strings

1 Answers1