When should I use dictionary encoding in parquet?

Question

I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation:

Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)

The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).

Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.

Okay.... so how do I know when to use dictionary encoding or not?

Is there any rule of thumb to help? e.g. if 90% of values in a column are expected to be in some particular set I should use them?

I have a use case where I expect three different scenarios for different columns:

integer column where all values lie within a very small set → seems perfect for dictionary encoding
integer column where 99% of values lie within a very small set but 1% are unlikely to form any clustering → not sure
string column where no value is likely to be the same → seems like dictionary encoding is a bad idea

Is there any documentation explaining which strategy is appropriate under various conditions?

Probably it makes sense to keep the dictionary feature enabled and it gets disabled automatically for columns that don't have repeated values: "If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding". — collimarco, Jun 01 '23 at 11:34

score 0 · Answer 1 · answered Oct 30 '20 at 17:51

I'm not aware of any documentation (on the Arrow side at least) that recommends when to use or not dictionary encoding. It's a good question and your instincts are reasonable--maybe you can try writing those kinds of data both ways and comparing file size and read/write speed. I'd be interested to see what you find.

When should I use dictionary encoding in parquet?

Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)

1 Answers1