1

I have been testing the various compression algorithms with parquet files, and have settled on Zstd.

Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.

enter image description here

The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):

  • Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
  • Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
  • Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030

And for comparison the log from using the adaptive dictionary:

  • Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
  • Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
  • Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779

Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?

  • What type of data are you dealing with? In particular what is the schema and what is in each column? Could you also please add the code you used to generate the files? – 0x26res May 13 '22 at 08:06
  • My data source is an SQL Server and I use it to fill a DataTable in C#. I also use a custom column data table to have compliant data types as such for each System data type: `case "System.UInt32": clmscheme.DataType = typeof(uint).AsNullableType(); var listuint = new uint?[rowcount];clmscheme.Values = listuint; binaryWriterDelegates.Add((n, val) => listuint[n] = GetValueOrNull(val)); break; ` To generate the Parquet files with compression I am using ParquetSharp since it uses the native cpp Parquet library with a C# wrapper, and the compressions offered by the apache parquet. – SomewhatInterested May 17 '22 at 06:57
  • Regardless if I use the System.Data schema or my custom schema the generated parquet files with and without dictionary for Zstd compression vary by the above mentioned values. And I am trying to understand if can I improve the compression by training a dictionary since I am dealing with large tables (~90 Mb) and I've read that dictionary training is done on small files. But it seems illogical for the compression to vary to such a degree considering an empty dictionary should yield at least similar results as a no dictionary usage. – SomewhatInterested May 17 '22 at 07:06

0 Answers0