With RDBMS we usually use a numeric columns for keys (both foreign and primary), as it allows for a better joined query performance and smaller resource usage, in most cases, than other data types (like strings).
The question is, what should be the data type of choice for the key columns in a Parquet tables? Can we go like this:
SELECT * FROM parquet_table1 JOIN parquet_table2 ON t1_string_pk = t2_string_fk
What is the best practice here?
The reason for this question is that when loading data into a data warehouse, any numeric key column (for a target table) requires a key table lookup ([source system, source key] -> surrogate key), and the string key column does not; we may use source key concatenation to get us a string surrogate key value.