0

I've got a similar table which I'm trying to pivot in Redshift:

UUID Key Value
a123 Key1 Val1
b123 Key2 Val2
c123 Key3 Val3

Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.

select * 
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))

Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?

What I am trying to achieve;

select * 
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
AIViz
  • 82
  • 9

2 Answers2

0

From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html

So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.

Updated with requested stored procedure with results to a cursor example:

In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.

Here's what you want to happen as individual SQL statements but first lets set up the test data:

create table test (UUID varchar(16), Key varchar(16), Value varchar(16));

insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');

The actions you want to perform are first to create a string for the PIVOT clause IN list like so:

select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;

Then you want to take this string and insert it into your PIVOT query which should look like this:

select * 
from (select UUID, "Key", value from test) 
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);

But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.

We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:

CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE 
  row record;
BEGIN
  select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
  OPEN curs1 for EXECUTE 'select *
    from (select UUID, "Key", value from test) 
      PIVOT (max(value) for "key" in ( ' || row.keys || ' )
  );';
END;
$$ LANGUAGE plpgsql;

What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.

When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.

call pivot_on_all_keys('mycursor');

All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.

fetch all from mycursor;

I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:

fetch 1000 from mycursor;

The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:

call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;

Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).

For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.

===================================

To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:

CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE 
  row record;
BEGIN
  select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
  EXECUTE 'drop table if exists test_stage;';
  EXECUTE 'create table test_stage AS select *
    from (select UUID, "Key", value from test) 
      PIVOT (max(value) for "key" in ( ' || row.keys || ' )
  );';
END;
$$ LANGUAGE plpgsql;

call pivot_on_all_keys();
select * from test_stage;

If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
  • Can you expand on "sequence of 2 queries" please. – AIViz Mar 04 '22 at 17:35
  • The issue is that the sub-query in the IN clause needs to return its results back to the leader node for recompile. This is not an allowed path and you are getting an error. So the IN clause list needs to be passed to Redshift as text in the SQL. If you are issuing from code then the "select distinct "key" from tbl" can be issued first, the results saved to a variable, and then added to the PIVOT query as plain text. Any programming language that connects to Redshift can do this, this can be done in a shell script, and I think even some benches can do this. – Bill Weiner Mar 04 '22 at 19:08
  • Or this process can be done inside of Redshift with a stored procedure. Read the list of distinct keys from a query and then issue a new query with this result as part of the SQL. – Bill Weiner Mar 04 '22 at 19:10
  • This is trickier than I thought. How can I store the output of a query to a variable and then use it in pivot query? I converted the distinct keys into a list using listagg function but not sure how to use the result as a TEXT in second part. Struggling to get it working. Do you mind posting a sample script please? Thanks in advance. – AIViz Mar 05 '22 at 20:08
  • Sure. It'll be a day or 2 b/c there's a lot going on this weekend. Are you looking to do this in Redshift with a stored procedure? I'll run with this assumption unless I hear otherwise. – Bill Weiner Mar 05 '22 at 22:58
  • If you need it sooner here's an answer that is somewhat similar to doing a stored procedure like we're talking about. Will take some modification but has most of the steps (and few unneeded ones). https://stackoverflow.com/questions/65492621/how-to-join-system-tables-or-information-schema-tables-with-user-defined-tables/65500498#65500498 – Bill Weiner Mar 05 '22 at 23:03
  • Yes, i will do it using stored procedure. Thank you very much! Having a look now on the link you shared. – AIViz Mar 06 '22 at 12:23
  • This is magic. Thank you!! From what I'm reading online it's best to use temp tables rather than cursors. As I want to update another table based on the results how can I insert the results into a table? I tried doing it but struggling a bit... THANK YOU FOR ALL YOUR HELP – AIViz Mar 10 '22 at 14:17
  • test_tbl -> procedure to pivot and generate results -> drop staging table -> create table as (select * from result_pivot_query) -> update main_tbl from staging table. This is the process I'm trying to follow. Staging table doesn't need to be temporary – AIViz Mar 10 '22 at 14:28
  • If you are reading the data out (like a select statement) then cursors is the way to go. If you want to populate a table (temp or otherwise) then yes, writing a table is far better. This is because there is no optimized path from the cursor on the leader node back to the compute nodes. I think I still have a buffer around with this code and can pound out a table example quickly. – Bill Weiner Mar 10 '22 at 16:16
  • Populating a table is what I want. Can't thank you enough for your help – AIViz Mar 10 '22 at 16:19
  • You're welcome - just paying it forward for all those that helped me – Bill Weiner Mar 10 '22 at 17:54
0

A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:

SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;

can be written as:

SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;

Even multi-level pivoting can also be handled in the same syntax.

SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;