6

I was thinking about using polars in place of numpy in a parsing problem where I turn a structured text file into a character table and operate on different columns. However, it seems that polars is about 5 times slower than numpy in most operations I'm performing. I was wondering why that's the case and whether I'm doing something wrong given that polars is supposed to be faster.

Example:

import requests
import numpy as np
import polars as pl

# Download the text file
text = requests.get("https://files.rcsb.org/download/3w32.pdb").text

# Turn it into a 2D array of characters
char_tab_np = np.array(file.splitlines()).view(dtype=(str,1)).reshape(-1, 80)

# Create a polars DataFrame from the numpy array
char_tab_pl = pl.DataFrame(char_tab_np)

# Sort by first column with numpy
char_tab_np[np.argsort(char_tab_np[:,0])]

# Sort by first column with polars
char_tab_pl.sort(by="column_0")

Using %%timeit in Jupyter, the numpy sorting takes about 320 microseconds, whereas the polars sort takes about 1.3 milliseconds, i.e. about five times slower.

I also tried char_tab_pl.lazy().sort(by="column_0").collect(), but it had no effect on the duration.

Another example (Take all rows where the first column is equal to 'A'):

# with numpy
%%timeit
char_tab_np[char_tab_np[:, 0] == "A"]
# with polars
%%timeit
char_tab_pl.filter(pl.col("column_0") == "A")

Again, numpy takes 226 microseconds, whereas polars takes 673 microseconds, about three times slower.

Update

Based on the comments I tried two other things:

1. Making the file 1000 times larger to see whether polars performs better on larger data.

Results: numpy was still about 2 times faster (1.3 ms vs. 2.1 ms). In addition, creating the character array took numpy about 2 seconds, whereas polars needed about 2 minutes to create the dataframe, i.e. 60 times slower.

To re-produce, just add text *= 1000 before creating the numpy array in the code above.

2. Casting to integer.

For the original (smaller) file, casting to int sped up the process for both numpy and polars. The filtering in numpy was still about 5 times faster than polars (30 microseconds vs. 120), wheres the sorting time became more similar (150 microseconds for numpy vs. 200 for polars).

However, for the large file, polars was marginally faster than numpy, but the huge instantiation time makes it only worth if the dataframe is to be queried thousands of times.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Qunatized
  • 197
  • 1
  • 9
  • I guess that's just the performance hit you take for the added convenience of dataframes over regular arrays. In fact, on my computer, numpy is 5 times faster than polars, and polars itself is 3x faster than pandas – Pranav Hosangadi Jan 19 '23 at 05:25
  • I'm not sure why you expect polars to be faster than a numpy array though. It is faster than pandas, which is what it's supposed to be – Pranav Hosangadi Jan 19 '23 at 05:37
  • @PranavHosangadi well I thought polars can run parallel on multiple cores whereas numpy is single-threaded, so shouldn't it be approximately {number of cores} times faster than numpy? – Qunatized Jan 19 '23 at 05:52
  • 2
    Note that polars is handling this using its Utf8 data type, which supports arbitrary length unicode strings. That likely creates some overhead. I find that casting each character to an int increases the polars speed by 5x: `char_tab_np = np.array(text.splitlines()).view(dtype='|U1').view(np.int32).astype(np.uint8).reshape(-1,80)` – Nick ODell Jan 19 '23 at 05:57
  • There is always an overhead to invoking a multithreaded program. There is the cost of starting a thread, waiting for it to be scheduled, and waiting for it to finish. For long enough programs, that overhead is outweighed by parallelism. But for short programs, the overhead dominates. – Nick ODell Jan 19 '23 at 06:02
  • @NickODell You are right; casting to int does speed up the process, but it does for both numpy and polars, so polars is still about 5 times slower than numpy. On my pc numpy takes about 30 microseconds after casting, and polars takes 135. – Qunatized Jan 19 '23 at 06:10
  • Polars comes with a query optimizer. Polars does a lot more work which normally gets amortized over a longer query and more data. How many rows does your benchmark have? Try at least 10M rows or so. – ritchie46 Jan 19 '23 at 06:41
  • @ritchie46 I just updated my question with more data; the original file had 5k rows, so I made it 1000 times bigger, i.e. 5M rows. The speeds became more similar, but polars was still slower than numpy for string data, and in addition, it was about 60 times slower in instantiating the dataframe. – Qunatized Jan 19 '23 at 07:00

1 Answers1

3

Polars does extra work in filtering string data that is not worth it in this case. Polars uses arrow large-utf8 buffers for their string data. This makes filtering more expensive than filtering python strings/chars (e.g. pointers or u8 bytes).

Sometimes it is worth it, sometimes not. If you have homogeneous data, numpy is a better fit than polars. If you have heterogenous data, polars will likely be faster. Especially if you consider your whole query instead of these micro benchmarks.

ritchie46
  • 10,405
  • 1
  • 24
  • 43