0

Sorting by particular columns brings together all rows with the same tuple under those columns. I want to cluster all rows with the same value, but keep the groups in the same order in which their first member appeared.

Something like this:

import polars as pl

df = pl.DataFrame(dict(x=[1,0,1,0], y=[3,1,2,4]))

df.cluster('x')
# shape: (4, 2)
# ┌─────┬─────┐
# │ x   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 3   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1   ┆ 2   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0   ┆ 1   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0   ┆ 4   │
# └─────┴─────┘
drhagen
  • 8,331
  • 8
  • 53
  • 82

1 Answers1

0

This can be done by:

  1. Storing the row index temporarily
  2. Setting the row index to the lowest value within a window over the columns of interest
  3. Sorting by that minimum index
  4. Deleting the temporary row index column
import polars as pl

df = pl.DataFrame(dict(x=[1,0,1,0], y=[3,1,2,4]))

(
df
  .with_column(pl.arange(0, pl.count()).alias('_index'))
  .with_column(pl.min('_index').over('x'))
  .sort('_index')
  .drop('_index')
)
# shape: (4, 2)
# ┌─────┬─────┐
# │ x   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 3   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1   ┆ 2   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0   ┆ 1   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0   ┆ 4   │
# └─────┴─────┘
drhagen
  • 8,331
  • 8
  • 53
  • 82