Choice of a data type for a Paquet join columns (i.e., keys)

Asked Mar 16 '20 at 08:53

Active Mar 16 '20 at 08:53

Viewed 53 times

With RDBMS we usually use a numeric columns for keys (both foreign and primary), as it allows for a better joined query performance and smaller resource usage, in most cases, than other data types (like strings).

The question is, what should be the data type of choice for the key columns in a Parquet tables? Can we go like this:

SELECT * FROM parquet_table1 JOIN parquet_table2 ON t1_string_pk = t2_string_fk

What is the best practice here?

The reason for this question is that when loading data into a data warehouse, any numeric key column (for a target table) requires a key table lookup ([source system, source key] -> surrogate key), and the string key column does not; we may use source key concatenation to get us a string surrogate key value.

asked Mar 16 '20 at 08:53

Matvey Zhuravel

Choice of a data type for a Paquet join columns (i.e., keys)

0 Answers0