-1

My goal is to find the top 1000 values for any column in any table, and how frequently the values appear across the dataset in a given timeframe.

I'm expecting the output:

<column_name>, PercentOfDataSet
Value1, X.XX
Value2, X.XX
Value3, X.XX

Currently I'm doing this in 2 steps, to get the total number of rows in one query, and then divide by that number on the following one:

sql_row_count := 'select count(*) from '|| table_name ||' where '|| date_column ||' > DATEADD(day, -365, getdate());';
total_row_count := (execute immediate :sql_row_count);
sql_query := 'select '|| column_name ||', iff('|| total_row_count ||' = 0, 0.00, cast(count(*) as numeric(18,2))/'|| total_row_count ||'*100) PercentOfDataSet' ||
                 ' from '|| table_name ||' where '|| date_column ||'> DATEADD(day, -365, getdate())' ||
                 ' group by 1 order by 2 desc limit 1000;';

Is there a better way to calculate this with SQL?

(this is based on the code found on Snowflake dynamic SQL: Storing query results as variables between executions)

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325

1 Answers1

1

Instead of doing this in 2 queries, you can use ratio_to_report() over() to add all the counts of each value appearance:

with data as
    (select $1 a1 from values(1), (1), (3), (3), (2), (1), (3), (4), (5), (1), (1))

select * exclude(c)
from (
    select a1, count(*) c, ratio_to_report(c) over() PercentOfDataSet
    from data
    group by a1
    order by PercentOfDataSet desc
    limit 3
)

enter image description here

With this query you can avoid the 2 steps in Snowflake dynamic SQL: Storing query results as variables between executions (and you can incorporate the logic for the date window).

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325