I have a data set that includes some information about the status of product feature flags for a given account. It looks approximately like this
+-----------+------------------------------------------------------------------------------------+
| Account | Flags |
+-----------+------------------------------------------------------------------------------------+
| Account A | {"feature_flag_a":"enabled","feature_flag_b":"enabled"} |
| Account B | {"feature_flag_a":"enabled","feature_flag_x":"enabled","feature_flag_y":"enabled"} |
+-----------+------------------------------------------------------------------------------------+
Some information about feature flags
- They are not finite - new flags are added often
- There's many flags (up to hundreds)
From my reading, the "correct" way to model this data would be to add it to the account
dimension table like so:
+----+--------------+------------------------+------------------------+------------------------+------------------------+
| id | account_name | feature_flag_a_enabled | feature_flag_b_enabled | feature_flag_x_enabled | feature_flag_y_enabled |
+----+--------------+------------------------+------------------------+------------------------+------------------------+
| 1 | Account A | 1 | 1 | 0 | 0 |
| 2 | Account B | 1 | 0 | 1 | 1 |
+----+--------------+------------------------+------------------------+------------------------+------------------------+
However this will result in a dimension table that could be hundreds of columns wide. It'll also make it sub-optimal to answer questions like 'how many accounts have any flag enabled?' or 'how many accounts have all enabled'. Not to mention the difficulty of creating the transform in SQL with a dynamic schema.
I have tried creating a dense, long table that looks like the following:
+-------------+-------+--------+
| account_id | flag | status |
+-------------+-------+--------+
| 1 | a | 1 |
| 1 | b | 1 |
| 1 | x | 0 |
| 1 | y | 0 |
| 2 | a | 1 |
| 2 | b | 0 |
| 2 | x | 1 |
| 2 | y | 1 |
+-------------+-------+--------+
While this is useful for analysis, it doesn't really fit into a star schema.
So, my question is, how do I correctly model this data? Is columns appended to the dimension table the correct answer or is there an option I haven't considered?