Postgres C Function - Aggregate, Variadic, Any number of any type of argument

Question

I would like to write an aggregate function using C. The aggregate function will be used as:

select f(rc, col1, col2,..., coln) from table

The sfunc would then have signature (ignoring types for the moment)

f(_state, rc, col1, col2,..., coln)

The interpretation is as follows: I want the function to either return rc or _state.rc depending on the current _state. In pseudocode the function would appear as follows:

f(_state, rc, col1, col2,..., coln)
{
  nargs = extract_variadic_args(fcinfo, 0, false, &args, &types, &nulls);

  for (i = 0; i < nargs; i++) {
    if (_state.coli > coli) return (_state.rc, _state.col1, ..., _state.colnargs)
    else if (_state.coli == coli) continue;
    else return (rc, col1, ..., colnargs);
  }

  return (rc, col1, ..., colnargs);
}

Essentially, the aggregate is a fold that returns the value of rc associated to the maximum value of some order defined by col1, ..., coln.

For example, given some table

T(rc text, col1 int, col2 real, col3 timestamp)

  T('Bob', 1, 2.0, 'May 30 2020')
  T('Tom', 2, 1.0, 'May 10 2020')
  T('Rich', 3, 1.0, 'Apr 01 2020')
  T('Bill', 3, 2.0, 'Apr 02 2020')
  T('Andy', 3, 2.0, 'Apr 03 2020')

  select f(rc, col1, col2,..., coln) res from T

  res
  ------
  'Andy'

The reason Andy is returned is because it has the highest value in the total ordering imposed by (col1, ..., coln).

This function cannot be written in PGSQL (as far as I can tell) since anyelement requires all usages to refer to the same type.

However, I believe this can be done in C as witnessed by the implementation of json_build_object. See https://doxygen.postgresql.org/json_8c.html#aeb269f2659c7bdfc734428c2cd211a4e

The json_build_object function takes any number of arguments of any type.

For the function, f, that I am proposing, however, there is an additional complication. The first argument to an sfunc in an aggregate is a type used to maintain the state of the fold (an stype). The stype will have to be able to hold an arbitrary number of arbitrarily typed values to be compared to the incoming arguments (col1, ..., coln).

Basically, stype would be something like

  CREATE TYPE priority_type AS (
    rc  anyelement, 
    col1 any 
    ... 
    coln any
  );

With f defined to match as

f(_state priority_type, rc anyelement, col1 any, ..., coln any) returns anyelement

Based on json_build_object, I believe this can be done, but I am very unclear about how to handle the priority_type, and also unclear about how to use PG_RETURN_DATUM to return a value of a type that I don't know until I inspect the rc argument's type. In other words, in all examples I have seen, the PG_RETURN_* seems to know the type it needs to return. In some cases, there may be a need to construct a value based on some other type (e.g. the type of rc) and then return that constructed value. Since we don't know the type we are returning until runtime, I am unclear how to construct and return that type without including a case for every possible type of rc. For the function I am proposing here we may be able to get around the need to construct a new value since we are either returning rc or _state.rc, but that depends on how the priority_type is ultimately constructed.

I am new to postgres C functions, having only written them once or twice, and this seems to be a very advanced use case. If anyone can put together a working implementation that would be amazing.

EDIT (Based on answer by Laurenz Albe): Question Still Open

Laurenz, thank you very much for the answer you provided. I found it very interesting and I took some time to fully understand it and to expand it to see if it could be used in my system. I am including the full code here for other readers that may find it useful. Unfortunately, I am not able to use it for myself. I believe I still need a C function, and I will explain exactly why after the included code. As an aside, I do not think the C function needs the column names passed in as strings. Appealing again to json_build_object, we can see that the variable number of variably typed columns are not passed as strings; and, that is what I believe is required in my case.

create table test_t(rc text, col1 int, col2 real, col3 timestamp);
insert into test_t values
  ('Bob', 1, 2.0, 'May 30 2020'),
  ('Tom', 2, 1.0, 'May 10 2020'),
  ('Rich', 3, 1.0, 'Apr 01 2020'),
  ('Bill', 3, 2.0, 'Apr 02 2020'),
  ('Andy', 3, 2.0, 'Apr 03 2020');

-- See: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=c179acbdbc2f0a52f0d5f81a9a9266c7
create or replace function util.map_array(text, anyarray, out anyarray) returns anyarray language plpgsql as $$
begin
--   select
--   util.map_array('abs(#)', array[-1,-3,-2]),
--   'reverse(#)' -< '{abc,def}'::text[],
--   '''t1.''||#' -< '{abc,def}'::text[],
--   'replace(#, ''a'', ''x'')' -< '{aba,bab}'::text[];
  execute format('select array(select %s)', replace($1, '#', 'unnest($1)')) into $3 using $2;
  return;
end $$;

create or replace function util.map_string(text, text, out text) returns text language plpgsql as $$
begin
--   select
--   'reverse(#)' -< 'abc,def',
--   '''t1.''||#' -< 'abc,def',
--   'replace(#, ''a'', ''x'')' -< 'abc,def';
  execute format('select array_to_string(array(select %s), '','')',
    replace($1, '#', 'unnest(string_to_array($1,'',''))')) into $3 using $2;
  return;
end $$;

create operator -< (procedure = util.map_array, leftarg = text, rightarg = anyarray);
create operator -< (procedure = util.map_string, leftarg = text, rightarg = text);

CREATE or replace FUNCTION util.max_by_cols_withproc(_state anyelement, cr anyelement, proc regprocedure, cols text) RETURNS anyelement
   LANGUAGE plpgsql AS
$$
DECLARE
   r boolean;
BEGIN
   EXECUTE format('SELECT %s($1, $2, $3)', proc::regproc) INTO r
      USING _state, cr, cols;

   IF NOT r THEN
      RETURN _state;
   ELSE
      RETURN cr;
   END IF;
END;
$$;

CREATE or replace FUNCTION util.max_by_cols(_state anyelement, cr anyelement, cols text) RETURNS anyelement
   LANGUAGE plpgsql AS
$$
DECLARE
   r boolean;
BEGIN
   EXECUTE format('SELECT %s($1, $2, $3)', 'util.compare_by_cols'::regproc) INTO r
      USING _state, cr, cols;

   IF NOT r THEN
      RETURN _state;
   ELSE
      RETURN cr;
   END IF;
END;
$$;

CREATE AGGREGATE util.max_by_cols(anyelement, regprocedure, text) (
   SFUNC = util.max_by_cols_withproc,
   STYPE = anyelement
);

CREATE AGGREGATE util.max_by_cols(anyelement, text) (
   SFUNC = util.max_by_cols,
   STYPE = anyelement
);

CREATE or replace FUNCTION util.compare_by_cols(t1 anyelement, t2 anyelement, cols text) RETURNS boolean
   LANGUAGE plpgsql IMMUTABLE STRICT AS
$$
DECLARE
   lhs text;
   rhs text;
   r boolean;
BEGIN
    SELECT '''$1.''||#' -< cols INTO lhs;
    SELECT '''$2.''||#' -< cols INTO rhs;
    EXECUTE format('SELECT (%1$s) < (%2$s)', lhs, rhs) INTO r USING t1, t2;
    RETURN r;
END;
$$;

select (util.max_by_cols(x, 'util.compare_by_cols'::regproc, 'col1,col2,col3')).rc FROM test_t x;
select (util.max_by_cols(test_t, 'col1,col2,col3')).rc FROM test_t;
select (util.max_by_cols(x, 'col1,col2,col3')).rc FROM test_t x join test_t y on x.rc=y.rc;

While the code above may work for some cases, it has some very specific limitations.

The shape of the stype is forced to be the same shape as the row in the table. If we were folding in such a way that additional state properties were required this would not suffice. For example, if I wanted to pass a constant along (say the id of some prioritization list in a priority table), I would be forced to store this id in the table row to get the required shape, even though I don't want it to be stored in the underlying table.
The first argument to the sfunc, util.max_by_cols(), is the stype, but we are using the table name to populate that argument. Looking at the third example above where I join test_t x to test_t y, it is clear that I must use the alias of one of the tables (I used x in the example above). However, what I require is to be able to pass columns from both tables, x and y, in my use case. To be specific, table x is my main table, and table y holds the prioritization order for the attribute I am using to prioritize in table x. In other words I need:

select (util.max_by_cols(x join y, 'x.col1,x.col2,y.col3')).rc FROM test_t x join test_t y on x.rc=y.rc

That is, the shape of my stype is some subset of the columns of all of the tables being joined, plus any arbitrary items required for maintaining state during the fold.

In general, the function should work in the following context:

select
  t1.a,  -- the group by key(s)
  sum(t1.b),  -- aggregating something in t1 or t2
  f(t1.c, t2.d, t1.e, t1.f)  -- our new function, f, picking t1.c
                             -- using ordering of (t2.d, t1.e, t1.f)
from t1 join t2 on t1.da = t2.da   -- exactly one record in t2 matches
group by t1.a

In summary, the sql solution has two primary problems, it forces the shape of the stype (which limits the flexibility of the fold), and it limits the shape of the stype to columns from only a single table. It is true that the second limitation can be worked around with subqueries, but I am in an environment where the structure of the query is already programmatically generated and I would like to avoid changing that structure for this use case. Regardless, the limitation on the shape of the stype being equal to the shape of the row, is still problematic. I will fiddle more, but I would really love to see the C solution. Ultimately, seeing how this could be done in C would likely open up a world of more interesting possibilities in the future.

You'd have to pass column names as strings. I'd approach this differently: have one argument of your aggregate be a `regprocedure` that is a comparison function for rows of the table. Use that in your aggregate to find the maximum row. — Laurenz Albe, Sep 16 '21 at 02:30

score 0 · Answer 1 · answered Sep 16 '21 at 06:10

0

There is no need for a C function, just create your own aggregate using a custom comparison function:

CREATE FUNCTION sfunc(anyelement, anyelement, regprocedure) RETURNS anyelement
   LANGUAGE plpgsql AS
$$DECLARE
   r boolean;
BEGIN
   EXECUTE format('SELECT %s($1, $2)', $3::regproc) INTO r
      USING $1, $2;

   IF NOT r THEN
      RETURN $1;
   ELSE
      RETURN $2;
   END IF;
END;$$;

CREATE AGGREGATE mymax(anyelement, regprocedure) (
   SFUNC = sfunc,
   STYPE = anyelement
);

Then define a custom comparison function for t:

CREATE FUNCTION t_lt(t1 t, t2 t) RETURNS boolean
   LANGUAGE sql IMMUTABLE STRICT AS
'SELECT ROW(t1.col1, t1.col2, t1.col3) < ROW(t2.col1, t2.col2, t2.col3)';

Now you can use the aggregate:

SELECT (mymax(t, 't_lt'::regproc)).rc FROM t;
  rc  
══════
 Andy
(1 row)

answered Sep 16 '21 at 06:10

Laurenz Albe

209,280
17
206
263

Laurenz, thank you for your excellent answer. I have edited the original question to include your advice, but the original question still remains. As I have described above, the sql solution has limitations that prevent its general use, while I think the C solution would bypass these issues. Thank you again. – docjosh Sep 16 '21 at 17:35
For your first point: you can use a different `STYPE`. I just didn't because it was not necessary for the problem in your question. I freely admit that your second point is too complicated for me to understand. I won't keep you from writing C functions, I just wanted to point out that you can do more than you think without that. Your original question asks "If anyone can put together a working implementation that would be amazing", and that clearly is out of scope for a Stackoverflow answer. – Laurenz Albe Sep 16 '21 at 18:26
Thank you again for the comments. If I were to use a different stype, I would have to create that stype per table. The idea is to be able to use an stype that has a fixed structure at the beginning and variadic at the end, as in CREATE TYPE priority_type AS ( rc anyelement, col1 any ... coln any). I could add fixed columns to act as parameters to the beginning, but the end is still variadic in the columns of whichever table I choose. The second point is simply that your solution does not seem to allow me to use the columns of a joined table without an additional subquery. – docjosh Sep 16 '21 at 18:55

Postgres C Function - Aggregate, Variadic, Any number of any type of argument

1 Answers1

Linked