4

Ok so after getting exceptions about not being able to write keys into a parquet file via spark I looked into the API and found only this.

public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {....

(My assumption could be wrong =D, and there might be another API somewhere. )

Ok This makes some warped sense, after all you can project/restrict the data as it is materialising out of the container file. However, just to be on the safe side. A Parquet file does not have the notion of a sequence file's "key" value , right ?

I find this a bit odd, the Hadoop infrastructure builds around the fact that a sequence file may have a key. And I assume this key is used liberally to partition data into blocks for locality (not at the HDFS level ofc) ? Spark has a lot of API calls that work with the code to do reductions and join's etc. Now I have to do extra step to map the keys out from the body of the materialised object. Weird.

So any good reasons why a key is not a first class citizen in the parquet world ?

Hassan Syed
  • 20,075
  • 11
  • 87
  • 171

1 Answers1

4

You are correct. Parquet file is not a key/value file format. It's a columnar format. Your "key" can be a specific column from your table. But it's not like HBase where you have a real key concept. Parquet is not a sequence file.

jmspaggi
  • 113
  • 1
  • 7