How to select only columns with more than one distinct value?

Question

SELECT DISTINCT … removes duplicate rows. Is there any way to remove duplicate columns, that is, columns where every value is the same? I'm looking through a legacy schema for the "distinguishing features" of some rows, so this would be super helpful. If it makes it any easier all of the columns are simple types like INT, TEXT and TIMESTAMP. Basically, given a table like this:

CREATE TEMPORARY TABLE column_test
(
    foo TEXT,
    bar INT,
    baz BOOLEAN,
    bat TIMESTAMP WITH TIME ZONE
);

INSERT INTO column_test (
    foo, bar, baz, bat
) VALUES
    ('lorem ipsum', 1, TRUE, '2000-01-01 UTC'),
    ('lorem ipsum', 2, TRUE, '2000-01-01 UTC'),
    ('other', 3, TRUE, '2000-01-01 UTC');

is it possible to write a query which will select only the foo and bar columns, since those are the only columns with more than one value? Something like this:

SELECT columns_with_more_than_one_value(*) FROM column_test;
foo           bar
'lorem ipsum' 1
'lorem ipsum' 2
'other'       3

I guess one way would be to SELECT *, transpose the result, remove any rows where all the values are the same, then transpose again, but that would be very complex and I don't know how I would keep the original column names in the result.

is this maybe a job for whatever is processing the data - select all the candidate columns then write a non-sql script to process the results to filter out those columns? — micsthepick, May 04 '20 at 00:14
@micsthepick That would be a last resort, since it would not integrate nicely with my IDE and would make casual exploration 10x slower. — l0b0, May 04 '20 at 00:19
A standard SQL statement must return a *fixed* number of columns. So what you ask for basically requires dynamic SQL, which makes the task much more oomplex. — GMB, May 04 '20 at 00:19
@GMB Well, it is technically a fixed number of columns, except the set of columns isn't known until some of the data has been inspected. And the complexity is why I asked :) — l0b0, May 04 '20 at 00:21
@l0b0: that's the point. Standard SQL requires the number of columns to be known before the statement is executed. — GMB, May 04 '20 at 00:22
Oh, it doesn't have to be at all portable, if that's any help. — l0b0, May 04 '20 at 00:32

score 1 · Answer 1 · answered May 04 '20 at 14:31

Basically you cannot select unknown columns from a table. A query result must have a structure defined before it is executed. What you can do is to create a (temporary) view that contains expected columns. The function below does the job, extensively using dynamic SQL. The first argument of the function is a table name, the second - the name of the temporary view to be created.

create or replace function create_view_with_distinct_columns(text, text)
returns void language plpgsql as $$
declare
    col text;
    ct int;
    list text = '';
begin
    for col in 
        execute format('
            select attname
            from pg_attribute
            where attrelid = %s
            and attnum > 0',
            $1::regclass::oid)
    loop
        execute format('
            select count(distinct %I)
            from %I',
            col, $1)
        into ct;
        if ct > 1 then
            list:= format('%s%s,', list, col);
        end if;
    end loop;
    execute format('
        create temp view %I as
        select %s
        from %I',
        $2, left(list, -1), $1);
end $$;

Use:

select create_view_with_distinct_columns('column_test', 'column_view');

select * from column_view;

     foo     | bar
-------------+-----
 lorem ipsum |   1
 lorem ipsum |   2
 other       |   3
(3 rows)

Db<>fiddle.

How to select only *columns* with more than one distinct value?

1 Answers1

How to select only columns with more than one distinct value?