-1

In Redshift we have a table (let's call it entity) which among other columns it has two important ones: hierarchy_id & entity_timestampt, the hierarchy_id is a combination of the ids of three hierarchical dimensions (A, B, C; each one having a relationship of one-to-many with the next one).
Thus: hierarchy_id == A.a_id || '-' || B.b_id || '-' || C.c_id
Additionally the table is distributed according to DISTKEY(hierarchy_id) and sorted using COMPOUND SORTKEY(hierarchy_id, entity_timestampt).

Over this table we need to generate multiple reports, some of them are fixed to the depths level of the hierarchy, while others will be filtered by higher parts and group the results by the lowers. However, the first layer of the hierarchy (the A dimension) is what defines our security model, users will never have access to different A dimensions other than the one they belong (this is our tenant information).
The current design proven to be useful for that matter when we were prototyping the reports in plain SQL as we could do things like this for the depths queries:

WHERE
  entity.hierarchy_id = 'fixed_a_id-fixed_b_id-fixed_c_id' AND
  entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'

Or like this for filtering by other points of the hierarchy:

WHERE
  entity.hierarchy_id LIKE 'fixed_a_id-%' AND
  entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'

Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.

Now we want to use QuickSight for creating and sharing those reports using the embedding capabilities. But we haven't found a way to filter the data of the analysis as we want.
We tried to use the RLS by tags for annonymous users, but we have found two problems:

  1. How to inject the A.a_id part of the query in the API that generates the embedding URL in a secure way (i.e. that users can't change it), While allowing them to configure the other parts of the hierarchy. And finally combining those independent pieces in the filter; without needing to generate a new URL each time users change the other parts.
    (however, we may live with this limitation but)
  2. How to do partial filters; i.e., the ones that looked like LIKE 'fixed_a_id-fixed_b_id-%' Since it seems RLS is always an equals condition.

Is there any way to make QuickSight to work as we want with our current table design? Or would we need to change the design?
For the latter, we have thought on keeping the three dimension ids as separated columns, that way we may add RLS for the A.a_id column and use parameters for the other ones, the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.

1 Answers1

1

COMPOUND SORTKEY(hierarchy_id, entity_timestampt)

You are aware you are sorting on only the first eight bytes of hierarchy_id? and the ability of the zone map to differentiate between blocks is based purely on the first eight bytes of the string?

I suspect you would have done a lot better to have had three separate columns.

Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.

I may be wrong - I would need to check - but I think if you use operators of any kind (such as functions, or LIKE, or even addition or subtraction) on a sortkey, the zone map does not operate and you read all blocks.

Also in your case, it may be - I've not tried using it yet - if you have AQUA enabled, because you're using LIKE, your entire query is being processed by AQUA. The performance consequences of this, positive and/or negative, are completely unknown to me.

Have you been using the system tables to verify your expectations of what is going on with your queries when it comes to zone map use?

the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.

You are now facing the fundamental nature of sorted column-store; the sorting you choose defines the queries you can issue and so also defines the queries you cannot issue.

You either alter your data design, in some way, so what you want becomes possible, or you can duplicate the table in question where each duplicate has different sorting orders.

The first is an art, the second has obvious costs.

As an aside, although I've never used Quicksight, my experience with all SQL generators has been that they are completely oblivious to sorting and so the SQL they issue cannot be used on Big Data (as sorting is the method by which Big Data can be handled in a timely manner).

If you do not have Big Data, you'll be fine, but the question then is why are you using Redshift?

If you do have Big Data, the only solution I know of is to create a single aggregate table per dashboard, about 100k rows, and have the given dashboard use and only use that one table. The dashboard should normally simply read the entire table, which is fine, and then you avoid the nightmare SQL it normally will produce.

  • _"You are aware you are sorting on only the first eight bytes of hierarchy_id?"_ Really? Do you have a source or something? I didn't find anything but also I didn't found anything about examples of sorting with a `VARCHAR` column. If you are right, then we need to rethink the design because we are not winning anything at all; I was just hoping it would do an alphanumeric sort and that the query engine would be smart enough to see that the `LIKE` had the wildcard at the end so it still can use the sorting. About **QuickSight** the AWS speech is that it should work as expected, but we will test. – Luis Miguel Mejía Suárez Aug 22 '21 at 14:43
  • Yes, really. This is even in the AWS docs, which is saying something : I have an *extremely* dim view of the docs. A whole bunch of the other data types are not sorting in the way you'd expect either. I'm about to release a white paper which describes it all. –  Aug 22 '21 at 14:44
  • If you ask AWS about Redshift, they will *always* say "yes". The only way you can ever get a "no" is to *already know* something can't work and explain it to them. You are on your own with Redshift - you need to get expert advice and if you do not, if your system has a reasonable volume, performance will be poor and erratic. –  Aug 22 '21 at 14:45
  • _"Yes, really. This is even in the AWS docs"_ can you please provide the link? I have been reading the docs day and night during the last week and I don't recall reading something about that. Although being honest the **AWS** docs are one of the worse I have ever seen. – Luis Miguel Mejía Suárez Aug 22 '21 at 14:47
  • I've just been looking and I'm having trouble finding it. There was something in there about how if you have identical leading prefixes on strings, you'll get no sorting, because only the first eight bytes are being used. –  Aug 22 '21 at 14:49
  • 1
    It's often hard to find stuff in the docs. Nearest I can find now is this https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html - go down to interleaved sorting and you'll find a mention of a "limited number of characters". It's awful. The problem with the docs is they obfuscate everything which is not a strength. –  Aug 22 '21 at 14:53
  • BTW, I'm a Redshift specialist. I started using it the day it went live and I've been working on a book about Redshift internals for the last three years. Redshift really is using the first eight bytes - you can trust me on this one. –  Aug 22 '21 at 14:54
  • 1
    Amazing. I didn't think the docs could get worse, and then they do. Here, I found a blog post. BTW don't put too much faith in the rest of the blog post. https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/ –  Aug 22 '21 at 15:00
  • Quote : "The values within the block are prefixed with a string 8 bytes or longer. The minimum and maximum values in the block header allow for 8 bytes of data to track each value. For long strings, we take the first 8 characters as a prefix. Example: I filter on a column that stores URLs that are always prefixed with: ‘https://’ so the minimum and maximum values are always the same prefix, regardless of the rest of the URL." –  Aug 22 '21 at 15:00
  • But that's the tip of the iceberg. You need to understand how each data type sorts. About half sort as you'd intuitively expect; the other half do not. –  Aug 22 '21 at 15:05
  • Hey thank you very much for the detailed answer and comments, I will communicate this with the team and redesign our tables. - We will probably go with multiple columns; quick question, is we have a compound key of `(A.a_id, B.b_id, C.c_id, entity_timestamp)` most queries will ask for the four keys, but some of them will only ask for `A.a_id` & `entity_timestamp` is there a way to tell redshift to pick all values in the intermediate sort columns so that it will still sort for the last one? Or rather we would need to move `entity_timestamp` higher in the order? – Luis Miguel Mejía Suárez Aug 22 '21 at 15:44
  • 1
    You will need to move `entity_timestamp`. However, there is one other way; maintain a second copy of the table, with the sort key of `(A.a_id, entity_timestamp)`, and direct queries which only specify those two columns to that copy of the table. –  Aug 22 '21 at 16:38