0

How to find the COMPRESSION_CODEC used on a Parquet file at the time of its generation? asks about identifying compression methods for the columns in the file.

Is there a method to determine (either by embedded metadata or by analysis of the file itself) the software which generated the parquet file?

Additionally, since the answers in the original question refer to a package which has been removed from the web, a pointer to an answer for the original question would be appreciated.

context: Analyzing a large data set to estimate potential recompression savings.

Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
  • Does this answer your question? [How to find the COMPRESSION\_CODEC used on a Parquet file at the time of its generation?](https://stackoverflow.com/questions/57573478/how-to-find-the-compression-codec-used-on-a-parquet-file-at-the-time-of-its-gene) – Robert Harvey Jun 10 '23 at 01:06
  • @RobertHarvey sadly no, it answers part, but the answers to that question refer to software which has been removed from distribution. clarified that I'm looking for the second half. – Mark Harrison Jun 10 '23 at 01:52

1 Answers1

0

The footer of a Parquet file contains a bunch of metadata, including which version of which software wrote the parquet file. If your footer is non encrypted (more info on Parquet encryption here), you can simply have a look at a hex dump of the footer.

On linux you can do something like:

hd myParquetFile

and have a look at the last part of the output. It will contain something like for example parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94) which tells you which software/version wrote this file.

If your file is really big, you might want to limit your output to something like the last 100 bytes of the parquet file:

tail --byte 110 myParquetFile
Koedlt
  • 4,286
  • 8
  • 15
  • 33