0

A simple row-wise shuffle in Polars with

df = df.sample(frac=1.0)

has a peak memory usage of 2x the size of the dataframe (profiling with mprof).

Is there any fast way to perform a row-wise shuffle in Polars while keeping the memory usage down as much as possible? Shuffling column by column (or a batch of columns at a time) with the same seed (or .take with random index) does the trick but is quite slow.

Danny Friar
  • 383
  • 4
  • 17

1 Answers1

2

A shuffle is not in-place. Polars memory is often shared between columns/series/arrow.

A shuffle therefore has to allocate a new memory buffer. If we shuffle the whole DataFrame in parallel (which sample does). We allocate new buffers in parallel and write the shuffled data, hence the 2x memory usage.

ritchie46
  • 10,405
  • 1
  • 24
  • 43