1

I am using apache arrow golang library to read parquet. No-repeated column seems straight forward, but how can I read repeated field?

flexwang
  • 625
  • 6
  • 16

1 Answers1

0

For reading repeated fields in Parquet there's really two answers: a complex way and an easy way.

The easy way is to use the pqarrow package and just read directly into an Arrow list array of some kind and let the complexity be handled for you. (https://pkg.go.dev/github.com/apache/arrow/go/v10@v10.0.1/parquet/pqarrow)

To read them the complex way, you have to understand repetition and definition levels and how Parquet uses them. Instead of trying to explain them here, I'm going to point you to the excellent write-up on the Apache Arrow blog here: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ which explains how to decode definition and repetition levels (yes it's in the context of the Rust implementation of Parquet, but the basic concepts are the same for the Go implementation).

All of the ColumnChunkReader types allow you to retrieve those Definition and Repetition levels in their ReadBatch methods. For an example have a look at https://pkg.go.dev/github.com/apache/arrow/go/v10@v10.0.1/parquet/file#Float32ColumnChunkReader.ReadBatch

When you call ReadBatch you can pass an []int16 for the definition levels and the repetition levels to be filled in alongside the data, and then you can use those to decode the repeated field accordingly. Personally, I prefer to use the pqarrow package which does it for you, but sometimes you do need the granular access.

Zeroshade
  • 463
  • 2
  • 8