How do I get the 1000 most common values in a column, and report the percentage that each represents?

Question

My goal is to find the top 1000 values for any column in any table, and how frequently the values appear across the dataset in a given timeframe.

I'm expecting the output:

<column_name>, PercentOfDataSet
Value1, X.XX
Value2, X.XX
Value3, X.XX

Currently I'm doing this in 2 steps, to get the total number of rows in one query, and then divide by that number on the following one:

sql_row_count := 'select count(*) from '|| table_name ||' where '|| date_column ||' > DATEADD(day, -365, getdate());';
total_row_count := (execute immediate :sql_row_count);
sql_query := 'select '|| column_name ||', iff('|| total_row_count ||' = 0, 0.00, cast(count(*) as numeric(18,2))/'|| total_row_count ||'*100) PercentOfDataSet' ||
                 ' from '|| table_name ||' where '|| date_column ||'> DATEADD(day, -365, getdate())' ||
                 ' group by 1 order by 2 desc limit 1000;';

Is there a better way to calculate this with SQL?

(this is based on the code found on Snowflake dynamic SQL: Storing query results as variables between executions)

score 1 · Answer 1 · answered May 26 '23 at 23:08

Instead of doing this in 2 queries, you can use ratio_to_report() over() to add all the counts of each value appearance:

with data as
    (select $1 a1 from values(1), (1), (3), (3), (2), (1), (3), (4), (5), (1), (1))

select * exclude(c)
from (
    select a1, count(*) c, ratio_to_report(c) over() PercentOfDataSet
    from data
    group by a1
    order by PercentOfDataSet desc
    limit 3
)

With this query you can avoid the 2 steps in Snowflake dynamic SQL: Storing query results as variables between executions (and you can incorporate the logic for the date window).

https://docs.snowflake.com/en/sql-reference/functions/ratio_to_report

How do I get the 1000 most common values in a column, and report the percentage that each represents?

1 Answers1

Linked