I have a parquet
file (~1.5 GB) which I want to process with polars
. The resulting dataframe has 250k rows and 10 columns. One column has large chunks of texts in it.
I have just started using polars, because I heard many good things about it. One of which is that it is significantly faster than pandas.
Here is my issue / question:
The preprocessing of the dataframe is rather slow, so I started comparing to pandas
. Am I doing something wrong or is polars for this particular use case just slower? If so: is there a way to speed this up?
Here is my code in polars
import polars as pl
df = (pl.scan_parquet("folder/myfile.parquet")
.filter((pl.col("type")=="Urteil") | (pl.col("type")=="Beschluss"))
.collect()
)
df.head()
The entire code takes roughly 1 minute whereas just the filtering part takes around 13 seconds.
My code in pandas
:
import pandas as pd
df = (pd.read_parquet("folder/myfile.parquet")
.query("type == 'Urteil' | type == 'Beschluss'") )
df.head()
The entire code also takes roughly 1 minute whereas just the querying part takes <1 second.
The dataframe has the following types for the 10 columns:
- i64
- str
- struct[7]
- str (for all remaining)
As mentioned: a column "content
" stores large texts (1 to 20 pages of text) which I need to preprocess and the store differently I guess.
EDIT: removed the size part of the original post as the comparison was not like for like and does not appear to be related to my question.