17

Why we cannot use boolean values in aggregate functions without casting to some integer type first? In many cases it makes perfect sense to calculate sum, average or correlation from columns of boolean data type.

Consider the following example where boolean input has to be always casted to int in order to make it work:

select
   sum(boolinput::int),
   avg(boolinput::int),
   max(boolinput::int),
   min(boolinput::int),
   stddev(boolinput::int),
   corr(boolinput::int,boolinputb::int)   
from
   (select 
      (random() > .5)::boolean as boolinput,
      (random() > .5)::boolean as boolinputB 
    from 
      generate_series(1,100)
   ) a

From PostgreSQL documentation:

Valid literal values for the "true" state are: TRUE 't' 'true' 'y' 'yes' 'on' '1'

For the "false" state, the following values can be used: FALSE 'f' 'false' 'n' 'no' 'off' '0'

Because by definition TRUE equals 1 and FALSE equals 0 I do not understand why casting is necessary.

Allowing boolean in aggregation would have also interesting side effects - we can for example simplify many case statements:

Current version (clean and easy to understand):

select sum(case when gs > 50 then 1 else 0 end) from generate_series(1,100) gs;

Using old fashioned casting operator :::

select sum((gs > 50)::int) from generate_series(1,100) gs;

Direct aggregation of boolean values (not working currently):

select sum(gs > 50) from generate_series(1,100) gs;

Is direct aggregation of boolean values possible in other DBMSs? Why this is not possible in PostgreSQL?

Tomas Greif
  • 21,685
  • 23
  • 106
  • 155
  • 2
    "*Because by definition TRUE equals 1 and FALSE equals 0*" that might be true for a database like MySQL which doesn't actually have a `boolean` data type (and a very "relaxed" datatype checking as well) but is certainly ***not*** true for Postgres (or any DBMS or programming language that has a *real* `boolean` datatype) –  Nov 29 '13 at 08:58
  • 1
    @a_horse_with_no_name I can understand that technically this is not the same, but still there is a fixed build-in link between `boolean` and `int` in PostgreSQL: `select true::int = 1::int and false::int = 0::int;` – Tomas Greif Nov 29 '13 at 09:06
  • 2
    Those are pre-defined rules on how to *convert* one value into another. It's the same thing why you can cast the string `'1'` to the number `1`. Are you also surprised that you can not `sum()` on a `varchar` column that only contains strings that are valid numbers? –  Nov 29 '13 at 09:09
  • @a_horse_with_no_name Good point. I am surprised that it does not work for boolean specifically. In some statistical packages it is easy to aggregate boolean values (in R: `mean(c(TRUE,FALSE,TRUE,FALSE,TRUE));`). – Tomas Greif Nov 29 '13 at 09:22
  • You can easily writer your own boolean aggregate that does not need casting. I guess another reason it's not defined is that some people might want a logical `OR` as the "aggregation" function, others would expect an `AND` and maybe someone even expects an `XOR` –  Nov 29 '13 at 09:25
  • 1
    A boolean type is _not_ a numeric type. For integers `1+1:=2`. For booleans: `True + True := **Nonsense**`. Instead of the `+` operator, there are `and` and `or` for booleans. And you can _count_ the number of tuples for which a boolean column is `True` or `False`. And that will yield a numeric type. – wildplasser Dec 01 '13 at 16:32
  • `'1'` is a valid literal for `TRUE`, but if you just write `'1'`, how PostgreSQL must determine type of this? `'1'` is a valid `text` also, may be you mean text? That's why you need to cast - to declare type explicitly. – Evgeny Nozdrev Oct 10 '18 at 13:14

5 Answers5

13

Because by definition TRUE equals 1 and FALSE equals 0 I do not understand why casting is necessary.

Per the docs you have quoted in your question, a boolean is not, by definition, 1 for TRUE and 0 for FALSE. It's not true in C either, where TRUE is anything non-zero.

For that matter, nor is it for languages that mimic C in this respect, of which there are many. Nor is it for languages such as Ruby, where anything non-Nil/non-False evaluates to True, including zero and empty strings. Nor is it for POSIX shell and variations thereof, where testing a return code yields TRUE if it is zero, and FALSE for anything non-zero.

Point is, a boolean is a boolean, with all sorts of colorful implementation details from a platform to the next; not an integer.

It's unclear how you were expecting Postgres to average true/false values. I'm suspicious that many if any platform will yield a result for that.

Even summing booleans is awkward: would expecting Postgres to OR the input values, or to count TRUE values?

At any rate, there are some boolean aggregate functions, namely bool_or() and bool_and(). These replace the more standard any() and some(). The reason Postgres deviates from the standard here is due to potential ambiguity. Per the docs:

SELECT b1 = ANY((SELECT b2 FROM t2 ...)) FROM t1 ...;

Here ANY can be considered either as introducing a subquery, or as being an aggregate function, if the subquery returns one row with a Boolean value.

http://www.postgresql.org/docs/current/static/functions-aggregate.html

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • Once there is a rule how to convert `boolean` to `int` (and there is such one in PostgreSQL) then there is no issue with aggregate functions definition, even with average, correlation and any other function. Some tools (not DBMSs) can work with boolean type this way (e.g. R). – Tomas Greif Nov 29 '13 at 09:26
3

Here are some possibilities

select max(c::int)::boolean, min(c::int)::boolean, bool_or(c) as max_b,bool_and(c) as min_b from
(
        select false as c
  union select true
  union select null
) t
Sergey Mirvoda
  • 3,209
  • 2
  • 26
  • 30
3

Here is how one can achieve max(boolean)

CREATE AGGREGATE max(boolean) (
  SFUNC=boolor_statefunc,
  STYPE=bool,
  SORTOP=">"
);  

where "boolor_statefunc" is built in function

big
  • 1,888
  • 7
  • 28
  • 48
2

Another option is ARRAY_AGG:

SELECT 
  id,
  true = ANY(ARRAY_AGG(flag)) AS flag
FROM my_table
GROUP BY id
sdgfsdh
  • 33,689
  • 26
  • 132
  • 245
  • 1
    This is genius. I'm curious why the `true =` is necessary since you would not see it as required in any other boolean context I'm aware of. It certainly does not work without it, but I don't understand why. – Hambone Nov 03 '22 at 14:50
0

To sum boolean values, I have created the following custom aggregate function:

create or replace function badd (bigint, boolean)
  returns bigint as
$body$
select $1 + case when true then 1 else 0 end;
$body$ language sql;

create aggregate sum(boolean) (
  sfunc=badd,
  stype=int8,
  initcond='0'
);

Now I can easily sum boolean values or count rows meeting specific condition:

with test (a, b, c) as (
   values
      ('true'::boolean,'a'::varchar, 'd'::text),
      ('true'::boolean,'a'::varchar, 'e'::text),      
      ('false'::boolean,'a'::varchar, 'f'::text),
      ('true'::boolean,'b'::varchar, 'd'::text),
      ('false'::boolean,'b'::varchar, 'd'::text),
      ('true'::boolean,'c'::varchar, 'f'::text),                
      (NULL,'c'::varchar,'d')      
    ) 
select 
   b,
   bsum(a) as sum, -- sum boolean value (TRUE=1, FALSE=0)
   bsum(c = 'd') as dsum -- counts all rows where column c equals to value 'd'
from 
   test
group by
   b
Tomas Greif
  • 21,685
  • 23
  • 106
  • 155