0

I've written a lazy data-processing function with polars to process a large parquet dataset. Is there a way I can select N rows from the parquet file and return a lazy dataset? I notice that both .fetch(N) and .head(N) return DataFrames, not LazyFrames. Do I have to do e.g. pl.scan_parquet(filename).fetch(100_000).lazy()?

My dataset does not have a monotonically increasing id column.

The intention is to see if my function finishes in reasonable time on a large slice of the dataset.

TomNorway
  • 2,584
  • 1
  • 19
  • 26

1 Answers1

1

I had simply overlooked .limit(). Usage is then:

pl.scan_parquet(filename).limit(n=N)

It looks to me like the .fetch() operation will recursively perform a .limit throughout the lazy query, which will allow fast debugging.

TomNorway
  • 2,584
  • 1
  • 19
  • 26