I'm writing a lambda to read records stored in Parquet files, restructure them into a partition_key: {json_record}
format, and submit the records to a Kafka queue. I'm wondering if there's any way to do this without reading the entire table into memory at once.
I've tried using the iter_row_groups
method from the fastparquet
library, but my records only have one row group, so I'm still loading the entire table into memory. And I noticed that the BufferReader
from pyarrow
has a readlines
method, but it isn't implemented. Is true line-by-line reading of Parquet not possible?
Might be worth pointing out that I'm working with Parquet files stored in S3, so ideally a solution would be able to read in StreamingBody