0

Starting Point:

I have a large table with many columns, of which the first is an identifier (ID) and the second a reversed rank over this ID (Rank). Each ID can have an arbitrary number of rows displaying either a NULL or a float value in each column with up to 50+ columns. There is no limiting range for the float values. Below is a small example table already grouped by the ID column.

ID Rank Column A Column B Column C Column D Column E
1 1 NULL 0.7 2.7 NULL 0.6
1 2 0.9 1.6 0.6 NULL NULL
1 3 NULL 0.4 2.3 NULL 0.3
1 4 0.6 NULL 1.4 NULL NULL

Goal:

I want to aggregate this table grouped by the ID, taking the non-NULL values with the highest rank. If only NULL values are found, a NULL value should be used. So the output for the table above should look like this:

ID Rank Column A Column B Column C Column D Column E
1 4 0.6 0.4 1.4 NULL 0.3

Standard aggregation functions such as MAX or MIN did not work for me because, exemplarily, in column B they would take 2.7 or 0.6, respectively, and not the correct value (1.4).

My current solution is outside of SQL and iterates for each ID over all columns returning the numeric value with the highest rank or NULL if there is no numeric value for a certain ID.

Any ideas on how to solve this in SQL?

HappyHippo
  • 13
  • 4

2 Answers2

1

You can create a string based on your rank column and the column value and sort it using min or max.

For example, if the highest rank value is 999, you can create from the ranks the following strings:

001
002
...
999 

Then to this new value concatenate - and the column value. Then we can sort by it. After having the max/min we can remove the not needed part and cast to the original type.

For example, using T-SQL/SQL Server this can be done like the following:

SELECT [ID]
      --,MAX([rank])
      ,SUBSTRING(MAX(RIGHT([rank] + 1000, 3) + '-' +CAST(ColumnA AS VARCHAR(32))), 5, 32)
      ,SUBSTRING(MAX(RIGHT([rank] + 1000, 3) + '-' +CAST(ColumnB AS VARCHAR(32))), 5, 32)
      ,SUBSTRING(MAX(RIGHT([rank] + 1000, 3) + '-' +CAST(ColumnC AS VARCHAR(32))), 5, 32)
      ,SUBSTRING(MAX(RIGHT([rank] + 1000, 3) + '-' +CAST(ColumnD AS VARCHAR(32))), 5, 32)
      ,SUBSTRING(MAX(RIGHT([rank] + 1000, 3) + '-' +CAST(ColumnE AS VARCHAR(32))), 5, 32)
FROM #TempTable
GROUP BY [ID]

enter image description here

Also, please note that I am concatenating the value using + in order to ensure that if a column has no value, the concatenation will yield NULL which will be ignore by the aggregation function.

gotqn
  • 42,737
  • 46
  • 157
  • 243
0

This can be retrieved using window functions. Since you did not specify a database I used Postgresql. Here is a running example: http://sqlfiddle.com/#!17/68520/1.

select 
distinct
"ID", 
max("Rank") over (partition by "ID") "Rank",
first_value("Column_A") over (partition by "ID" order by "Column_A" is null, "Rank" desc) "Column_A",
first_value("Column_B") over (partition by "ID" order by "Column_B" is null, "Rank" desc) "Column_B",
first_value("Column_C") over (partition by "ID" order by "Column_C" is null, "Rank" desc) "Column_C",
first_value("Column_D") over (partition by "ID" order by "Column_D" is null, "Rank" desc) "Column_D",
first_value("Column_E") over (partition by "ID" order by "Column_E" is null, "Rank" desc) "Column_E"
from Table1 

To create the data I used this DDL / SQL:

CREATE TABLE Table1
("ID" int, "Rank" int, "Column_A" float, "Column_B" float, "Column_C" float, "Column_D" float, "Column_E" float);

INSERT INTO Table1
("ID", "Rank", "Column_A", "Column_B", "Column_C", "Column_D", "Column_E")
VALUES
(1, 1, NULL, 0.7, 2.7, NULL, 0.6),
(1, 2, 0.9, 1.6, 0.6, NULL, NULL),
(1, 3, NULL, 0.4, 2.3, NULL, 0.3),
(1, 4, 0.6, NULL, 1.4, NULL, NULL);

The tricky part here is the use of a window function first_value and the needed ordering of values. All values for one ID value are sorted descending by Rank. Using "Column_A" is null for ordering moves all null values to the end (at least for Postgresql).

So this SQL results in

ID Rank Column_A Column_B Column_C Column_D Column_E
1 4 0.6 0.4 1.4 (null) 0.3

Update for SQLServer

Using SQLServer this "Column_A" is null construct is not usable within order by. It has to be replaced by case when "Column_A" is null then 1 else 0 end. So the SQLServer SQL looks like:

select 
distinct
"ID", 
max("Rank") over (partition by "ID") "Rank",
first_value("Column_A") over (partition by "ID" order by case when "Column_A" is null then 1 else 0 end, "Rank" desc) "Column_A",
first_value("Column_B") over (partition by "ID" order by case when "Column_B" is null then 1 else 0 end, "Rank" desc) "Column_B",
first_value("Column_C") over (partition by "ID" order by case when "Column_C" is null then 1 else 0 end, "Rank" desc) "Column_C",
first_value("Column_D") over (partition by "ID" order by case when "Column_D" is null then 1 else 0 end, "Rank" desc) "Column_D",
first_value("Column_E") over (partition by "ID" order by case when "Column_E" is null then 1 else 0 end, "Rank" desc) "Column_E"
from Table1
wumpz
  • 8,257
  • 3
  • 30
  • 25
  • Thanks for pointing out the missing database specification: I'm referencing SQL Server and edited the post. Your suggestion is quite close, but the is null statement seems to be not working in Server SQL. Any hints on how to adjust your solution to Server SQL? – HappyHippo Apr 18 '23 at 10:32
  • @HappyHippo I updated my answer and added the SQLServer specific SQL. Hope that helps. I used SQLServer 2017. – wumpz Apr 18 '23 at 14:17