I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation:
Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.
Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).
Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.
Okay.... so how do I know when to use dictionary encoding or not?
Is there any rule of thumb to help? e.g. if 90% of values in a column are expected to be in some particular set I should use them?
I have a use case where I expect three different scenarios for different columns:
- integer column where all values lie within a very small set → seems perfect for dictionary encoding
- integer column where 99% of values lie within a very small set but 1% are unlikely to form any clustering → not sure
- string column where no value is likely to be the same → seems like dictionary encoding is a bad idea
Is there any documentation explaining which strategy is appropriate under various conditions?