-1

I'm working with event sourced data where all of the important fields are combined to a JSONB column and many of the keys are missing from most of the database rows.

I would want to get:

  1. Aggregated combined values of included arrays in the JSONB field (see ingredients in the example)
  2. Latest non-null value according to timestamp

I tried this on my own and I was able to produce a example which generates exactly what I would want to achieve here but it looks rather ugly and I'm wondering how to make following query better.

Schema (PostgreSQL v15)

CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    identifier VARCHAR(255),
    timestamp TIMESTAMP WITH TIME ZONE,
    event_data JSONB
);
INSERT INTO events (identifier, timestamp, event_data)
VALUES
    ('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
    ('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
    ('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "user": "peterpan"}'),
    ('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "user": "robinhood"}'),
    ('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "user": "mickeymouse"}'),
    ('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');

Query #1

WITH events_flattened AS (
    SELECT
        identifier,
        timestamp,
        event_data->>'target' AS target,
        event_data->>'user' AS user,
        elem.part
    FROM events
    LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
    ORDER BY timestamp DESC
)

SELECT
    identifier,
    (ARRAY_REMOVE(ARRAY_AGG(e.target),NULL))[1] as target,
    (ARRAY_REMOVE(ARRAY_AGG(e.user),NULL))[1] as user,
    ARRAY_REMOVE(ARRAY_AGG(part),NULL) as ingredients,
    MAX(timestamp) as latest_update
FROM events_flattened e
GROUP BY identifier;

For the answer to be helpful it should produce exactly same result as seen on this table:

identifier target user ingredients latest_update
12345 12345678 robinhood Melon,Strawberry,Banana 2019-01-04T00:00:00.000Z
67890 0000 mickeymouse Potato 2019-01-04T00:00:00.000Z

View on DB Fiddle

I'm trying to determine which kind of query and indexes would be beneficial to get exactly this kind of data out from this table?

Onni Hakala
  • 563
  • 4
  • 18

2 Answers2

0
CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    identifier VARCHAR(255),
    timestamp TIMESTAMP WITH TIME ZONE,
    event_data JSONB
);
INSERT INTO events (identifier, timestamp, event_data)
VALUES
    ('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
    ('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
    ('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "us": "peterpan"}'),
    ('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "us": "robinhood"}'),
    ('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "us": "mickeymouse"}'),
    ('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');
WITH events_flattened AS (
    SELECT
        identifier,
        timestamp,
        event_data->>'target' AS target,
        event_data->>'us' AS us,
        elem.part
    FROM events
    LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
),

latest_target AS (
    SELECT DISTINCT ON (identifier) 
        identifier, 
        target
    FROM events_flattened
    WHERE target IS NOT NULL
    ORDER BY identifier, timestamp DESC
),

latest_user AS (
    SELECT DISTINCT ON (identifier) 
        identifier, 
        us
    FROM events_flattened
    WHERE us IS NOT NULL
    ORDER BY identifier, timestamp DESC
)

SELECT 
    e.identifier,
    t.target,
    u.us,
    ARRAY_REMOVE(ARRAY_AGG(e.part), NULL) AS ingredients,
    MAX(e.timestamp) AS latest_update
FROM events_flattened e
LEFT JOIN latest_target t ON e.identifier = t.identifier
LEFT JOIN latest_user u ON e.identifier = u.identifier
GROUP BY e.identifier, t.target, u.us;

identifier target us ingredients latest_update
12345 12345678 robinhood {""Banana"",""Strawberry"",""Melon""} 2019-01-04 00:00:00+00
67890 0000 mickeymouse {""Potato""} 2019-01-04 00:00:00+00

fiddle

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
  • Nice! Now your query works properly! I decided not to use the `LEFT JOIN LATERAL` and you can see the final query I came up with from the other answer and it uses subquery and `jsonb_array_elements` – Onni Hakala Jul 26 '23 at 15:41
  • I compared the 2 approaches with EXPLAIN ANALYZE and you can see that your approach is actually a bit slower. My answer: Planning Time: 0.404 ms Execution Time: 0.198 ms Your answer: Planning Time: 0.456 ms Execution Time: 0.215 ms I know it's negligible difference but please provide me a dataset where it would be faster to use your approach instead. – Onni Hakala Jul 27 '23 at 07:58
0

Here's the query I ended up using here. Using the subqueries and jsonb_array_elements was something my acquaintance Tommi Vainikainen came up with:


SELECT
  e.identifier,
  (
    SELECT e2.event_data->>'target'
    FROM events e2
    WHERE e2.identifier=e.identifier
    AND e2.event_data->>'target' IS NOT NULL
    ORDER by timestamp DESC LIMIT 1
  ) AS target,
  (
    SELECT e2.event_data->>'user'
    FROM events e2
    WHERE e2.identifier=e.identifier
    AND e2.event_data->>'user' IS NOT NULL
    ORDER by timestamp DESC LIMIT 1
  ) AS user,
  ARRAY(
    SELECT jsonb_array_elements(e2.event_data->'ingredients')
    FROM events e2
    WHERE e2.identifier=e.identifier
    AND e2.event_data->>'ingredients' IS NOT NULL
  ) AS ingredients,
  MAX(timestamp) AS latest_update
FROM events e
GROUP BY identifier;
Onni Hakala
  • 563
  • 4
  • 18
  • Avoid using subqueries in SELECT clause if possible.Each subquery runs once for every row returned by the main query. This can potentially lead to performance issues. An alternative is to use JOINs, or in your case, window functions like FIRST_VALUE to get the latest target and user. Use jsonb_array_elements carefully because this function can have performance issues on large datasets because it expands the JSON array into a set of JSON values. – Amira Bedhiafi Jul 26 '23 at 19:57
  • and FIRST_VALUE is hard here because of the NULLs. I would be happy to see example of how you would solve this problem with that – Onni Hakala Jul 27 '23 at 08:31