Ok so after getting exceptions about not being able to write keys into a parquet file via spark I looked into the API and found only this.
public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {....
(My assumption could be wrong =D, and there might be another API somewhere. )
Ok This makes some warped sense, after all you can project/restrict the data as it is materialising out of the container file. However, just to be on the safe side. A Parquet file does not have the notion of a sequence file's "key" value , right ?
I find this a bit odd, the Hadoop infrastructure builds around the fact that a sequence file may have a key. And I assume this key is used liberally to partition data into blocks for locality (not at the HDFS level ofc) ? Spark has a lot of API calls that work with the code to do reductions and join's etc. Now I have to do extra step to map the keys out from the body of the materialised object. Weird.
So any good reasons why a key is not a first class citizen in the parquet world ?