1

I have two streams of data that have limited duration (typically 1-60 seconds) and I want to store them in a compressed data file for later retrieval. Right now I am using HDF5, but I've heard about Parquet and want to give it a try.

Stream 1:

The data is arriving as a series of records, approximately 2500 records per second. Each record is a tuple (timestamp, tag, data) with the following sizes:

  • timestamp: 64-bit value
  • tag: 8-bit value
  • data: variable-length octets (typically about 100 bytes per record, sometimes more, sometimes less)

Stream 2:

The data is arriving as a series of records, approximately 100000 records per second. Each record is a tuple (timestamp, index, value) with the following sizes:

  • timestamp: 64 bits
  • index: 16-bit value
  • data: 32-bit value

Can I do this with Apache Parquet? I am totally new to this + can't seem to find the right documentation; I found documentation about reading/writing entire tables, but in my case I need to incrementally write to the tables in batches of some number of rows (depending on how large of a buffer I want to use).

I am interested in both Java and Python and can explore in either, but I'm more fluent in Python.

I found this page for pyarrow: https://arrow.apache.org/docs/python/parquet.html --- it talks about row groups and ParquetWriter and read_row_group() but I can't tell if it supports my use case.

Any suggestions?

Jason S
  • 184,598
  • 164
  • 608
  • 970

0 Answers0