1

I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record} format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into memory at once.

I've tried using the iter_row_groups method from the fastparquet library, but my records only have one row group, so I'm still loading the entire table into memory. And I noticed that the BufferReader from pyarrow has a readlines method, but it isn't implemented. Is true line-by-line reading of Parquet not possible?

Might be worth pointing out that I'm working with Parquet files stored in S3, so ideally a solution would be able to read in StreamingBody

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
James Kelleher
  • 1,957
  • 3
  • 18
  • 34
  • 3
    Currently playing around with the `ParquetFile.iter_batches` method from `pyarrow`, realizing that I was probably too fixated on reading line-by-line whereas reading in batches should still be very memory efficient – James Kelleher Aug 04 '22 at 23:36
  • 1
    If data is in S3 and you're writing to Kafka, I would use Pyspark in EMR, not a lambda – OneCricketeer Aug 09 '22 at 14:24

1 Answers1

0

I suggest you may look into DuckDB and polars:

One can certainly limit the query to the say top 1000 results. If you got some row index iterating through the whole parquet with duckdb and SELECT WHERE should be easy.

You may experiment with row_count_name and row_count_offset. Again, with an existing row index column reading rows as chunks is doable.

darked89
  • 332
  • 1
  • 2
  • 17