My source is JSON with nested arrays & structures. (examples at end of post)
Large volume of new data streaming real-time (20m/day)
I have to decide how to store this data, considering.....
-- End users want to use 'traditional' SQL
-- Performance (ingestion & query)
-- Load on Cluster
As far as I can see, my options, are to make use of the SUPER data type, or just convert everything to traditional relational tables & types.
(Even if I store the full JSON as a super, I still have to serialize critical attributes into regular columns for the purposes of Distribution/Sort.)
Regardless, been trying to weigh up the pros & cons of super vs. 'traditional'.
(1) Store full JSON as SUPER type
-- Very easy to ingest data with low load on cluster
-- Maybe an additional load on cluster & performance impact to execute end user queries?
-- End users would have to learn PartiQL and deal with unnesting & serialization etc
(2) 'Traditional' Relational Tables & Types
(a) Load as super, but then use PartiQL to unnest, serialize and store in relational tables
-- Additional continuous load on cluster
-- Easy to implement (insert into)
-- Would result in some massive tables for the 'tag' nodes
(b) Use lambda to pre-unnest & serialize json, insert/copy directly into relational tables
-- Lambda would be invoked continuously
(3) Redshift Spectrum
If I am converting to relational structure (2b), could simply store in S3 and utilze Redshift Spectrum to query
-- No load on cluster to ingest/process incoming data
-- Cost/maintenance of lambda/other process to transform JSON
Questions:
Are the above understandings correct ?
Any other considerations not listed ?
Is there a 'standard' for this scenario ?
Any other guidance/wisdeom welcome !
Background info
Schema:
events:array[ struct{
channels:array[
struct{
tags:array[struct{}]
}
tags:array[struct{}]
]
}
]