2

I have a structure like this (I used JSON to represent data here, but this can be an object in any form):

[ 
  {
    "DocID": ["A", "B"]
  },
  {},
]

Based on Dremel spec, The repetition level for the only data filed here "DocID" (which is repeated) is {0,1,0} and the definition level is {1,1,0} since the last item is null.

Now if I have something like this:

[ 
  {
    "DocID": ["A", "B"]
  },
  { "DocID": [null]},
]

Then again, the repetition level is {0,1,0} and definition level is {0,1,1}

For storing Dremel data in parquet, we never store null fields (Here)

So we store two value "A", "B" in this case (encoding doesn't matter), but for constructing the structure, the first RLevel is zero, so this is start of a new object, the first DLevel is 1, so this is not null. we read the first value, which is "A" (Correct), the second RLevel is 1 it means it is still the same object and it is a repeated field, the DLevel is 1 so it is not null, we read the second value which is "B" (Correct). The third RLevel is 0, this means a new object. In the first example, the DLevel is zero, so it is null, we don't need to read anything (there is nothing left) and it works. But in the second case, the DLevel is 1, so we need to read something, and there is nothing left to read.

What we should do in this case?

Just for context, I am co-author of fraugster/parquet-go library, and this is the issue we faced recently.

fzerorubigd
  • 1,664
  • 2
  • 16
  • 23

0 Answers0