3

We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues:

  1. The format of the Dask (i.e. fastparquet) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations?
skibee
  • 1,279
  • 1
  • 17
  • 37
  • I understand that there are various [parquet versions](http://matthewrocklin.com/blog/work/2017/06/28/use-parquet) but it is difficult to understand the differences – skibee Jul 31 '17 at 12:32
  • You should post these three questions as separate ones on Stack Overflow. Posting multiple ones as a single instance is quite hard to answer and integrate into the SO UI. – Uwe L. Korn Jul 31 '17 at 12:54
  • Thx for the input - Will do so – skibee Jul 31 '17 at 13:16

1 Answers1

3

(only answering to 1), please post separate questions to make it easier to answer)

_metadata and _common_metadata are helper files that are not required for a Parquet dataset, these ones are used by Spark/Dask/Hive/... to infer the metadata of all Parquet files of a dataset without the need to read the footer of all files. In constrast to this, Apache Drill generates a similar file in each folder (on demand) that contains all footers of all Parquet files. Only on the first query on a dataset all files are read, further queries will only read the file that caches all footers.

Tools using _metadata and _common_metadata should be able to leverage them to have faster execution times but not depend on them for operations. In the case that they are non-existent, the query engine then simply needs to read all footers.

Uwe L. Korn
  • 8,080
  • 1
  • 30
  • 42
  • Quite correct. In addition, fastparquet (the library that dask will have used to create the files) can also read a list of parquet data files without a `_metadata`, but initially loading metadata will be slower. Spark used to make these files, but no longer does, I believe hive still does. – mdurant Jul 31 '17 at 13:46