I would like to write an aggregate function using C. The aggregate function will be used as:
select f(rc, col1, col2,..., coln) from table
The sfunc would then have signature (ignoring types for the moment)
f(_state, rc, col1, col2,..., coln)
The interpretation is as follows: I want the function to either return rc or _state.rc depending on the current _state. In pseudocode the function would appear as follows:
f(_state, rc, col1, col2,..., coln)
{
nargs = extract_variadic_args(fcinfo, 0, false, &args, &types, &nulls);
for (i = 0; i < nargs; i++) {
if (_state.coli > coli) return (_state.rc, _state.col1, ..., _state.colnargs)
else if (_state.coli == coli) continue;
else return (rc, col1, ..., colnargs);
}
return (rc, col1, ..., colnargs);
}
Essentially, the aggregate is a fold that returns the value of rc associated to the maximum value of some order defined by col1, ..., coln.
For example, given some table
T(rc text, col1 int, col2 real, col3 timestamp)
T('Bob', 1, 2.0, 'May 30 2020')
T('Tom', 2, 1.0, 'May 10 2020')
T('Rich', 3, 1.0, 'Apr 01 2020')
T('Bill', 3, 2.0, 'Apr 02 2020')
T('Andy', 3, 2.0, 'Apr 03 2020')
select f(rc, col1, col2,..., coln) res from T
res
------
'Andy'
The reason Andy is returned is because it has the highest value in the total ordering imposed by (col1, ..., coln).
This function cannot be written in PGSQL (as far as I can tell) since anyelement requires all usages to refer to the same type.
However, I believe this can be done in C as witnessed by the implementation of json_build_object. See https://doxygen.postgresql.org/json_8c.html#aeb269f2659c7bdfc734428c2cd211a4e
The json_build_object function takes any number of arguments of any type.
For the function, f, that I am proposing, however, there is an additional complication. The first argument to an sfunc in an aggregate is a type used to maintain the state of the fold (an stype). The stype will have to be able to hold an arbitrary number of arbitrarily typed values to be compared to the incoming arguments (col1, ..., coln).
Basically, stype would be something like
CREATE TYPE priority_type AS (
rc anyelement,
col1 any
...
coln any
);
With f defined to match as
f(_state priority_type, rc anyelement, col1 any, ..., coln any) returns anyelement
Based on json_build_object, I believe this can be done, but I am very unclear about how to handle the priority_type, and also unclear about how to use PG_RETURN_DATUM to return a value of a type that I don't know until I inspect the rc argument's type. In other words, in all examples I have seen, the PG_RETURN_* seems to know the type it needs to return. In some cases, there may be a need to construct a value based on some other type (e.g. the type of rc) and then return that constructed value. Since we don't know the type we are returning until runtime, I am unclear how to construct and return that type without including a case for every possible type of rc. For the function I am proposing here we may be able to get around the need to construct a new value since we are either returning rc or _state.rc, but that depends on how the priority_type is ultimately constructed.
I am new to postgres C functions, having only written them once or twice, and this seems to be a very advanced use case. If anyone can put together a working implementation that would be amazing.
EDIT (Based on answer by Laurenz Albe): Question Still Open
Laurenz, thank you very much for the answer you provided. I found it very interesting and I took some time to fully understand it and to expand it to see if it could be used in my system. I am including the full code here for other readers that may find it useful. Unfortunately, I am not able to use it for myself. I believe I still need a C function, and I will explain exactly why after the included code. As an aside, I do not think the C function needs the column names passed in as strings. Appealing again to json_build_object, we can see that the variable number of variably typed columns are not passed as strings; and, that is what I believe is required in my case.
create table test_t(rc text, col1 int, col2 real, col3 timestamp);
insert into test_t values
('Bob', 1, 2.0, 'May 30 2020'),
('Tom', 2, 1.0, 'May 10 2020'),
('Rich', 3, 1.0, 'Apr 01 2020'),
('Bill', 3, 2.0, 'Apr 02 2020'),
('Andy', 3, 2.0, 'Apr 03 2020');
-- See: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=c179acbdbc2f0a52f0d5f81a9a9266c7
create or replace function util.map_array(text, anyarray, out anyarray) returns anyarray language plpgsql as $$
begin
-- select
-- util.map_array('abs(#)', array[-1,-3,-2]),
-- 'reverse(#)' -< '{abc,def}'::text[],
-- '''t1.''||#' -< '{abc,def}'::text[],
-- 'replace(#, ''a'', ''x'')' -< '{aba,bab}'::text[];
execute format('select array(select %s)', replace($1, '#', 'unnest($1)')) into $3 using $2;
return;
end $$;
create or replace function util.map_string(text, text, out text) returns text language plpgsql as $$
begin
-- select
-- 'reverse(#)' -< 'abc,def',
-- '''t1.''||#' -< 'abc,def',
-- 'replace(#, ''a'', ''x'')' -< 'abc,def';
execute format('select array_to_string(array(select %s), '','')',
replace($1, '#', 'unnest(string_to_array($1,'',''))')) into $3 using $2;
return;
end $$;
create operator -< (procedure = util.map_array, leftarg = text, rightarg = anyarray);
create operator -< (procedure = util.map_string, leftarg = text, rightarg = text);
CREATE or replace FUNCTION util.max_by_cols_withproc(_state anyelement, cr anyelement, proc regprocedure, cols text) RETURNS anyelement
LANGUAGE plpgsql AS
$$
DECLARE
r boolean;
BEGIN
EXECUTE format('SELECT %s($1, $2, $3)', proc::regproc) INTO r
USING _state, cr, cols;
IF NOT r THEN
RETURN _state;
ELSE
RETURN cr;
END IF;
END;
$$;
CREATE or replace FUNCTION util.max_by_cols(_state anyelement, cr anyelement, cols text) RETURNS anyelement
LANGUAGE plpgsql AS
$$
DECLARE
r boolean;
BEGIN
EXECUTE format('SELECT %s($1, $2, $3)', 'util.compare_by_cols'::regproc) INTO r
USING _state, cr, cols;
IF NOT r THEN
RETURN _state;
ELSE
RETURN cr;
END IF;
END;
$$;
CREATE AGGREGATE util.max_by_cols(anyelement, regprocedure, text) (
SFUNC = util.max_by_cols_withproc,
STYPE = anyelement
);
CREATE AGGREGATE util.max_by_cols(anyelement, text) (
SFUNC = util.max_by_cols,
STYPE = anyelement
);
CREATE or replace FUNCTION util.compare_by_cols(t1 anyelement, t2 anyelement, cols text) RETURNS boolean
LANGUAGE plpgsql IMMUTABLE STRICT AS
$$
DECLARE
lhs text;
rhs text;
r boolean;
BEGIN
SELECT '''$1.''||#' -< cols INTO lhs;
SELECT '''$2.''||#' -< cols INTO rhs;
EXECUTE format('SELECT (%1$s) < (%2$s)', lhs, rhs) INTO r USING t1, t2;
RETURN r;
END;
$$;
select (util.max_by_cols(x, 'util.compare_by_cols'::regproc, 'col1,col2,col3')).rc FROM test_t x;
select (util.max_by_cols(test_t, 'col1,col2,col3')).rc FROM test_t;
select (util.max_by_cols(x, 'col1,col2,col3')).rc FROM test_t x join test_t y on x.rc=y.rc;
While the code above may work for some cases, it has some very specific limitations.
The shape of the stype is forced to be the same shape as the row in the table. If we were folding in such a way that additional state properties were required this would not suffice. For example, if I wanted to pass a constant along (say the id of some prioritization list in a priority table), I would be forced to store this id in the table row to get the required shape, even though I don't want it to be stored in the underlying table.
The first argument to the sfunc, util.max_by_cols(), is the stype, but we are using the table name to populate that argument. Looking at the third example above where I join test_t x to test_t y, it is clear that I must use the alias of one of the tables (I used x in the example above). However, what I require is to be able to pass columns from both tables, x and y, in my use case. To be specific, table x is my main table, and table y holds the prioritization order for the attribute I am using to prioritize in table x. In other words I need:
select (util.max_by_cols(x join y, 'x.col1,x.col2,y.col3')).rc FROM test_t x join test_t y on x.rc=y.rc
That is, the shape of my stype is some subset of the columns of all of the tables being joined, plus any arbitrary items required for maintaining state during the fold.
In general, the function should work in the following context:
select
t1.a, -- the group by key(s)
sum(t1.b), -- aggregating something in t1 or t2
f(t1.c, t2.d, t1.e, t1.f) -- our new function, f, picking t1.c
-- using ordering of (t2.d, t1.e, t1.f)
from t1 join t2 on t1.da = t2.da -- exactly one record in t2 matches
group by t1.a
In summary, the sql solution has two primary problems, it forces the shape of the stype (which limits the flexibility of the fold), and it limits the shape of the stype to columns from only a single table. It is true that the second limitation can be worked around with subqueries, but I am in an environment where the structure of the query is already programmatically generated and I would like to avoid changing that structure for this use case. Regardless, the limitation on the shape of the stype being equal to the shape of the row, is still problematic. I will fiddle more, but I would really love to see the C solution. Ultimately, seeing how this could be done in C would likely open up a world of more interesting possibilities in the future.