4

I have a parquet file (~1.5 GB) which I want to process with polars. The resulting dataframe has 250k rows and 10 columns. One column has large chunks of texts in it.

I have just started using polars, because I heard many good things about it. One of which is that it is significantly faster than pandas.

Here is my issue / question:
The preprocessing of the dataframe is rather slow, so I started comparing to pandas. Am I doing something wrong or is polars for this particular use case just slower? If so: is there a way to speed this up?

Here is my code in polars

import polars as pl

df = (pl.scan_parquet("folder/myfile.parquet")
      .filter((pl.col("type")=="Urteil") | (pl.col("type")=="Beschluss"))
      .collect()
     )
df.head()

The entire code takes roughly 1 minute whereas just the filtering part takes around 13 seconds.

My code in pandas:

import pandas as pd 

df = (pd.read_parquet("folder/myfile.parquet")
    .query("type == 'Urteil' | type == 'Beschluss'") )
df.head()

The entire code also takes roughly 1 minute whereas just the querying part takes <1 second.

The dataframe has the following types for the 10 columns:

  • i64
  • str
  • struct[7]
  • str (for all remaining)

As mentioned: a column "content" stores large texts (1 to 20 pages of text) which I need to preprocess and the store differently I guess.

EDIT: removed the size part of the original post as the comparison was not like for like and does not appear to be related to my question.

FredMaster
  • 1,211
  • 1
  • 15
  • 35
  • 1
    Whenever the record size goes over a million I use Pyspark, in other cases always Pandas. – Tornike Kharitonishvili Feb 22 '23 at 09:12
  • What does `df.memory_usage(deep=True)` say in pandas? – Michel de Ruiter Feb 22 '23 at 09:17
  • `df.memory_usage(deep=True).sum()` results in `9710410625`. So I guess `.info()` was not the equivalent. But still, pandas is so much faster. – FredMaster Feb 22 '23 at 09:19
  • 1
    Did you include the time needed to load the data in memory? Polars is using lazy evaluation and loads data only when needed. Try using `scan_parquet(..).filter(..)` and compare the time needed with the full loading and processing time in pandas – Panagiotis Kanavos Feb 22 '23 at 09:53
  • The time incl. loading into memory is approximately the same for both (ca. 1 Minute). I use `pl.scan_parquet(...).filter(...).collect()` for polars. – FredMaster Feb 22 '23 at 10:01
  • Hi @FredMaster, what about if you set the column as categorical before running the filter? `df = df.with_columns(pl.col('type').cast(pl.Categorical)) ` – Luca Feb 22 '23 at 10:54
  • Hi @Luca. I had a very similar thought a couple minutes ago. Unfortunately, it doesn't really have an impact on computation time. I was actually wondering if polars is not suited for columns with large chunks of text? – FredMaster Feb 22 '23 at 11:12
  • Can you show your code for polars and pandas? You might be comparing apples with oranges. – ritchie46 Feb 22 '23 at 13:25
  • Hi @ritchie46. I have just updated the original post. Does this help? – FredMaster Feb 22 '23 at 14:53
  • Categorical will help if you initially save it that way. Converting to categorical on the fly just moves the expense to that conversion. – Dean MacGregor Feb 23 '23 at 13:58

2 Answers2

5

As mentioned: a column "content" stores large texts (1 to 20 pages of text) which I need to preprocess and the store differently I guess.

This is where polars must do much more work than pandas. Polars uses arrow memory format for string data. When you filter your DataFrame all the columns are recreated for where the mask evaluates to true.

That means that all the text bytes in the string columns need to be moved around. Whereas for pandas they can just move the pointers to the python objects around, e.g. a few bytes.

This only hurts if you have really large values as strings. E.g. when you are storing whole webpages for instance. You can speed this up by converting to categoricals.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • 1
    Ok, understood. So I take this in mind when using polars. Overall, I am really impressed with the library. Great work! – FredMaster Feb 22 '23 at 15:44
0

My first suspicion was this which was written by the author of Polars.

Polars does extra work in filtering string data that is not worth it in this case. Polars uses arrow large-utf8 buffers for their string data. This makes filtering more expensive than filtering python strings/chars (e.g. pointers or u8 bytes).

Then I did the following...

import polars as pl
import time
import numpy as np
num_rows=20000000
df=pl.DataFrame({'a':np.random.choice(
    ['apple','banana','carrot','date','eggplant'], num_rows), 
                 'b':np.random.rand(num_rows),
                 'c':np.random.rand(num_rows),
                 'd':np.random.rand(num_rows)})


%%timeit
df.filter((pl.col('a')=='apple') | (pl.col('a')=='banana'))

# 453 ms ± 39.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



import pandas as pd
dfpd=df.to_pandas()

%%timeit
dfpd.query("a=='apple' | a=='banana'")

# 2.25 s ± 64.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My pl.show_versions:

---Version info---
Polars: 0.16.2
Index type: UInt32
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.3
numpy: 1.23.5
fsspec: 2023.1.0
connectorx: 0.3.1
xlsx2csv: 0.8.1
deltalake: <not installed>
matplotlib: 3.6.3
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72