Condensing arrays in Presto

Question

I have a query that produces strings of arrays using they array_agg() function

SELECT 
array_agg(message) as sequence
from mytable
group by id

which produces a table that looks like this:

                 sequence
1 foo foo bar baz bar baz
2     foo bar bar bar baz
3 foo foo foo bar bar baz

but I aim to condense the array of strings so that none can repeat more than once in a row, for example, the desired output would look like this:

    sequence
1 foo bar baz bar baz
2 foo bar baz
3 foo bar baz

How would one go about doing this with Presto SQL ?

Martin Traverso · Accepted Answer · 2021-10-12T18:21:09.513

You can do this in one of two ways:

Remove duplicates from the resulting arrays using the array_distinct function:

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_distinct(array_agg(message)) AS sequence
FROM mytable
GROUP BY id

Use the DISTINCT qualifier in the aggregation to remove the duplicate values before they are passed into array_agg.

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'bar'), (1, 'baz'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'), (3, 'foo'),
  (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(DISTINCT message) AS sequence
FROM mytable
GROUP BY id

Both alternatives produce the same result:

    sequence
-----------------
 [foo, bar, baz]
 [foo, bar, baz]
 [foo, bar, baz]
(3 rows)

UPDATE: You can remove repeated sequences of elements with the recently introduced MATCH_RECOGNIZE feature:

WITH mytable(id, message) AS (VALUES
  (1, 'foo'), (1, 'foo'), (1, 'bar'), (1, 'baz'), (1, 'bar'), (1, 'baz'),
  (2, 'foo'), (2, 'bar'), (2, 'bar'), (2, 'bar'), (2, 'baz'),
  (3, 'foo'), (3, 'foo'), (3, 'foo'), (3, 'bar'), (3, 'bar'), (3, 'baz')
)
SELECT array_agg(value) AS sequence
FROM mytable
 MATCH_RECOGNIZE(
    PARTITION BY id
    MEASURES A.message AS value
    PATTERN (A B*)
    DEFINE B AS message = PREV(message)
)
GROUP BY id

Thanks for the answer, but I guess I didn't mention an important point (that I do not need distinct values, but rather non-repeating consecutive values). I modified the first row of my data to illustrate this point — iskandarblue, May 28 '19 at 23:44
A third way for the first part of the answer: you can use `set_agg()` to get the same results. I.e: `SELECT set_agg(message) AS sequence` — fredibeni, Oct 25 '22 at 17:45

Condensing arrays in Presto

1 Answers1