Single vs. multiple ID columns in data warehouse/lake

Question

I have setup a time-series / events database using the AWS Firehose -> S3/Glue -> Athena stack. It is being used to track various user actions - session started, action performed etc. across a number of our products. My question is about how best to store different types of IDs in this system.

The existing schema is one big 'fact table' with a bunch of different columns. Two of the most important columns are event_type_id and object_id. To use StackOverflow as an example, two events might be:

question_asked - in this case I would be storing the question id in the object_id column.
tag_created - in this case I would be storing the tag id in the object_id column.

My question is - is storing multiple different types of IDs in the same column bad practice? It's working OK for us at the moment, but it does require the person/system performing queries to know what type of object the object_id column refers to, based on the event they are querying.

If bad practice, what other approaches might be better? Multiple columns where they are NULL if not relevant for the event in that row? Or is this where dimension tables would be a better fit?

score 2 · Accepted Answer · answered Sep 03 '19 at 04:03

This isn't necessarily bad practice, depending on how you use it.

It sounds like you're aware of the potential pitfalls of such an approach (i.e. users of the data have to be aware of the context - in this case "event type" - to use the values correctly), so as you're using Athena you could mitigate that by creating views over source table for different event types, inserting a WHERE clause filter on event type and possibly renaming object_id to something more context specific e.g. question_id.

This makes it easier for users to work with the data and understand exactly what the values are they're working with.

In a big data environment I wouldn't recommend creating dimension tables if it can be avoided as JOINs between tables start to get expensive. Having multiple columns for different ids is possible but then you create new problems for users such as having to account for NULL values in an Id column, and this also potentially makes it harder to add new event types and ids as you have to change the schema to accommodate them.

Single vs. multiple ID columns in data warehouse/lake

1 Answers1