Parquet API doesn't have the concept of Keys?

Question

Ok so after getting exceptions about not being able to write keys into a parquet file via spark I looked into the API and found only this.

public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {....

(My assumption could be wrong =D, and there might be another API somewhere. )

Ok This makes some warped sense, after all you can project/restrict the data as it is materialising out of the container file. However, just to be on the safe side. A Parquet file does not have the notion of a sequence file's "key" value , right ?

I find this a bit odd, the Hadoop infrastructure builds around the fact that a sequence file may have a key. And I assume this key is used liberally to partition data into blocks for locality (not at the HDFS level ofc) ? Spark has a lot of API calls that work with the code to do reductions and join's etc. Now I have to do extra step to map the keys out from the body of the materialised object. Weird.

So any good reasons why a key is not a first class citizen in the parquet world ?

score 4 · Answer 1 · answered Feb 18 '14 at 21:33

4

You are correct. Parquet file is not a key/value file format. It's a columnar format. Your "key" can be a specific column from your table. But it's not like HBase where you have a real key concept. Parquet is not a sequence file.

answered Feb 18 '14 at 21:33

jmspaggi

113
1
7

Parquet API doesn't have the concept of Keys?

1 Answers1