Edit: As of Polars 0.14.1, we can use the //
operator as floor division instead of using floor
, so that the algorithm becomes:
step_x = 2
step_y = 2
(
df.with_columns(
[
((pl.col("x") // step_x) * step_x).alias("cell_x"),
((pl.col("y") // step_y) * step_y).alias("cell_y"),
]
)
.groupby(["cell_x", "cell_y"])
.agg(
[
pl.col("val").sum().alias("agg_value"),
]
)
.sort(["cell_x", "cell_y"]).collect()
)
One easy (and performant) way to solve this is to use the floor
function to calculate the grid coordinates. We can easily accomplish all of this in Lazy mode, and using only Polars Expressions. (And best of all, no slow nested for loops.)
Let's start with your data. We'll put the DataFrame in Lazy mode.
import polars as pl
df = (
pl.DataFrame(
{
"x": [0.5, 2, 2.5, 5.5],
"y": [1.5, 2.5, 3.5, 3.5],
"val": [1, 1, 1, 1],
}
)
).lazy()
df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ x ┆ y ┆ val │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0.5 ┆ 1.5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.0 ┆ 2.5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2.5 ┆ 3.5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5.5 ┆ 3.5 ┆ 1 │
└─────┴─────┴─────┘
We can aggregate the values into grid cells as follows:
step_x = 2
step_y = 2
(
df.with_columns(
[
((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
]
)
.groupby(["cell_x", "cell_y"])
.agg(
[
pl.col("val").sum().alias("agg_value"),
]
)
.sort(["cell_x", "cell_y"])
.collect()
)
shape: (3, 3)
┌────────┬────────┬───────────┐
│ cell_x ┆ cell_y ┆ agg_value │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞════════╪════════╪═══════════╡
│ 0.0 ┆ 0.0 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 2.0 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 2.0 ┆ 1 │
└────────┴────────┴───────────┘
Since the calculation of the coordinates is in a with_columns
context, they will run in parallel.
Other Notes
To trace how the algorithm maps each point to the lower-left coordinate of the grid that contains it, just comment out the groupby
and agg
methods.
step_x = 2
step_y = 2
(
df.with_columns(
[
((pl.col("x") / step_x).floor() * step_x).alias("cell_x"),
((pl.col("y") / step_y).floor() * step_y).alias("cell_y"),
]
)
.sort(["cell_x", "cell_y"])
.collect()
)
shape: (4, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ x ┆ y ┆ val ┆ cell_x ┆ cell_y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╪════════╪════════╡
│ 0.5 ┆ 1.5 ┆ 1 ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 2.5 ┆ 1 ┆ 2.0 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2.5 ┆ 3.5 ┆ 1 ┆ 2.0 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5.5 ┆ 3.5 ┆ 1 ┆ 4.0 ┆ 2.0 │
└─────┴─────┴─────┴────────┴────────┘
Also, I was careful to design the algorithm to work with negative grid coordinates (if you need that). In general, you have to be careful to distinguish between "truncating" and "casting" which converts -2.5 to -2, versus "floor" which converts -2.5 to -3. (We want "floor" in this case.)
For example:
import numpy as np
rng = np.random.default_rng(1)
nbr_rows = 10
df = (
pl.DataFrame(
{
"x": rng.uniform(-10, 10, nbr_rows),
"y": rng.uniform(-10, 10, nbr_rows),
"val": rng.integers(1, 10, nbr_rows),
}
)
.with_columns([pl.col(["x", "y"]).round(1).keep_name()])
.sort(["x", "y"])
.lazy()
)
df.collect()
shape: (10, 3)
┌──────┬──────┬─────┐
│ x ┆ y ┆ val │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞══════╪══════╪═════╡
│ -9.4 ┆ -4.8 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -7.1 ┆ -3.4 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -3.8 ┆ -3.9 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -1.8 ┆ -1.9 ┆ 9 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -1.5 ┆ -0.9 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 0.2 ┆ 5.1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1.0 ┆ -5.9 ┆ 7 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6.6 ┆ -7.3 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9.0 ┆ 0.8 ┆ 7 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9.0 ┆ 5.8 ┆ 3 │
└──────┴──────┴─────┘
The algorithm maps each point to lower-left grid coordinates as follows:
shape: (10, 5)
┌──────┬──────┬─────┬────────┬────────┐
│ x ┆ y ┆ val ┆ cell_x ┆ cell_y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞══════╪══════╪═════╪════════╪════════╡
│ -9.4 ┆ -4.8 ┆ 9 ┆ -10.0 ┆ -6.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -7.1 ┆ -3.4 ┆ 1 ┆ -8.0 ┆ -4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -3.8 ┆ -3.9 ┆ 5 ┆ -4.0 ┆ -4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -1.8 ┆ -1.9 ┆ 9 ┆ -2.0 ┆ -2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ -1.5 ┆ -0.9 ┆ 5 ┆ -2.0 ┆ -2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ -5.9 ┆ 7 ┆ 0.0 ┆ -6.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0.2 ┆ 5.1 ┆ 1 ┆ 0.0 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.6 ┆ -7.3 ┆ 2 ┆ 6.0 ┆ -8.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9.0 ┆ 0.8 ┆ 7 ┆ 8.0 ┆ 0.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9.0 ┆ 5.8 ┆ 3 ┆ 8.0 ┆ 4.0 │
└──────┴──────┴─────┴────────┴────────┘