Memory-efficient row-wise shuffle Polars

Question

A simple row-wise shuffle in Polars with

df = df.sample(frac=1.0)

has a peak memory usage of 2x the size of the dataframe (profiling with mprof).

Is there any fast way to perform a row-wise shuffle in Polars while keeping the memory usage down as much as possible? Shuffling column by column (or a batch of columns at a time) with the same seed (or .take with random index) does the trick but is quite slow.

score 2 · Answer 1 · answered Apr 30 '22 at 04:12

A shuffle is not in-place. Polars memory is often shared between columns/series/arrow.

A shuffle therefore has to allocate a new memory buffer. If we shuffle the whole DataFrame in parallel (which sample does). We allocate new buffers in parallel and write the shuffled data, hence the 2x memory usage.

Memory-efficient row-wise shuffle Polars

1 Answers1