0

Lazy more and query optimization in Polars is a great tool for saving on memory allocations and CPU usage for a single data frame. I wonder if there is a way to do this for multiple lazy frames as:

lpdf1 = pdf1.lazy()
lpdf2 = pdf2.lazy()

result_lpdf = -lpdf1/lpdf2
result_pdf = result_lpdf.collect()

The above code will not run, as division and negation is not implemented for LazyFrame. Yet my aim would be to create the new result_pdf frame without creating temporary frames for division, then yet another for negation (as it would be the case in pandas and numpy).

I'm trying to get some performance improvement relative to -pdf1/pdf2, on frames of size (283681, 93). Any suggestions are welcome.

Mark Horvath
  • 1,136
  • 1
  • 9
  • 24

1 Answers1

3

You can use .with_context()

Adding a suffix to one set of columns allows you to distinguish between them.

left = pl.DataFrame(dict(a=[-16, -12, -9], b=[20, 12, 10])).lazy()
right = pl.DataFrame(dict(a=[4, 3, 3], b=[10, 2, 5])).lazy()
(
   left
   .with_context(right.select(pl.all().suffix("_right")))
   .select(
      pl.col(name) * -1 / pl.col(f"{name}_right")
      for name in left.columns
   )
   .collect()
)
shape: (3, 2)
┌─────┬──────┐
│ a   | b    │
│ --- | ---  │
│ f64 | f64  │
╞═════╪══════╡
│ 4.0 | -2.0 │
├─────┼──────┤
│ 4.0 | -6.0 │
├─────┼──────┤
│ 3.0 | -2.0 │
└─//──┴─//───┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • Great trick! Seems like what I was looking for, yet there is one last bit I still don't follow. Would you expect this to be faster than simply doing -pdf1/pdf2 in eager mode? I'm finding the two has very similar performance. Could it be that this still allocates temporary frames (or series)? I'm editing the question to include my frame shapes. – Mark Horvath Dec 17 '22 at 17:03