How do i use STRING_AGG and avoid duplicate values?

Question

I face the following problem: in a SQL table i have stored Information about persons, date and diseases.

CREATE TABLE Diseases (person varchar(80), time date, disease varchar(80))

INSERT INTO Diseases (person, time, disease) VALUES ('Harry', '2018-03-06', 'A30')
INSERT INTO Diseases (person, time, disease) VALUES ('Harry', '2017-04-06', 'C27')
INSERT INTO Diseases (person, time, disease) VALUES ('Harry', '2016-08-12', 'Z07')
INSERT INTO Diseases (person, time, disease) VALUES ('Harry', '2016-08-12', 'C27')
INSERT INTO Diseases (person, time, disease) VALUES ('Ruth', '2018-08-12', 'B02')
INSERT INTO Diseases (person, time, disease) VALUES ('Ruth', '2018-08-12', 'C02')

Now i want to aggregate the disease codes of each person into one variable:

SELECT Diseases.person, 
STRING_AGG(Diseases.disease, ',') AS Diags, 
Diseases.time
FROM Diseases
GROUP BY Diseases.person, Diseases.time

How do I manage to aggregate only those diseases which were recognized before and at the same date of the current case (e.g. Harry should have Z07 and C27 in the row of the 2017 case)?

I want to get a table with name, date and a variable that contains a list of aggregated diseases. How do i manage that only distinct values of the diseases are included?

I tried the following but it doesnt work:

SELECT 
    today.person, 
    today.time, 
    today.disease, 
    STRING_AGG(DISTINCT past.DISEASE, ',') AS disease_history
FROM diseases today
LEFT JOIN diseases past 
    ON past.person = today.person AND past.time <= today.time
GROUP BY today.person, today.time, today.disease

To specify which records should be returned SQL has the `WHERE` clause. That should work for this case. — Lars Br., Jan 20 '22 at 07:46
I tried this but have no idea how to express that every record until the date of the case should be captured. — Philipp Schulz, Jan 20 '22 at 07:52
Doesn't this work? SELECT Diseases.person, STRING_AGG(Diseases.disease, ',') AS Diags FROM Diseases WHERE Diseases.time < '2018-03-06' GROUP BY Diseases.person — stersym, Jan 20 '22 at 08:05

Mathias Kemeter · Answer 1 · 2022-01-24T12:56:51.040

1

The requirement is not fully clear and leaves room for interpretation. Still let me take a stab to answer, what I think you've asked.

You can easily enhance your select statement with a where clause to aggregate only the disease prior to a specific date.

SELECT Diseases.person, 
STRING_AGG(Diseases.disease, ',') AS Diags, 
Diseases.time
FROM Diseases
WHERE Diseases.time < '2018-08-12'
GROUP BY Diseases.person, Diseases.time

However, I suspect, this is not what you want. Note, that you use the group by clause, which means that you would usually retrieve less records from your query than you have in the base table.

My guess is that you want to retrieve the same number of records, but for each individual case see the aggregated diseases that you have seen before and including (as you've expressed in a comment below) the respective date. If that's the case, you can solve this via a self join with diseases:

SELECT 
    today.person, 
    today.time, 
    today.disease, 
    STRING_AGG(past.DISEASE, ',') AS disease_history
FROM diseases today
LEFT JOIN diseases past 
    ON past.person = today.person AND past.time <= today.time
GROUP BY today.person, today.time, today.disease

Note that left join and join would yield the same result in this case as each record is now joined with its own.

In another comment, you've expressed the requirement to eliminate duplicate entries of historical diseases. This has to be done before feeding the join into function STRING_AGG. The easiest way is probably to use a sub-select to eliminate those duplicates via aggregation or distinct clause:

SELECT 
    person, 
    time, 
    disease, 
    STRING_AGG(past_disease, ',') as disease_history
FROM
(
    SELECT DISTINCT
        today.person, 
        today.time, 
        today.disease, 
        past.disease as past_disease 
    FROM diseases today
    LEFT JOIN diseases past 
        ON past.person = today.person AND past.time <= today.time
)
GROUP BY person, time, disease
ORDER BY person, time

As a result, you will find the historical diseases including the current occurrence with duplicates removed:

edited Jan 24 '22 at 12:56

answered Jan 20 '22 at 08:13

Mathias Kemeter

933
2
11

Thank you. Thats almost the right solution but i want to have the current diseases included instead of the NULL values. – Philipp Schulz Jan 20 '22 at 09:50
I updated the answer with your requirement. – Mathias Kemeter Jan 20 '22 at 10:12
Good solution- after the OPs comments I thought that this might be what was requested. I’m wondering if the order of prior diseases matter for this. – Lars Br. Jan 20 '22 at 20:00
Good point! If this is a requirement you can use the ORDER BY clause of the STRING_AGG function. Should be something like STRING_AGG(past.DISEASE, ',' ORDER BY past.time). See here: https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/a924ee1e98ab435a874efa32e6f0ae14.html – Mathias Kemeter Jan 21 '22 at 07:22
Actually the order doesnt matter but i face the problem that i have duplicates in disease history. – Philipp Schulz Jan 21 '22 at 11:54
Meanwhile, we are quite far from your original question. Maybe it's worth editing it to properly formalise your requirements? – Mathias Kemeter Jan 21 '22 at 12:06
Allright, i did so and used this description: https://popsql.com/learn-sql/postgresql/how-to-use-stringagg-in-postgresql Unfortunately it doesnt work. – Philipp Schulz Jan 24 '22 at 12:03
Removing the duplicates is just data wrangling - not so much a feature of `STRING_AGG`. Thanks for providing new sample data. I added the requirement to the answer. – Mathias Kemeter Jan 24 '22 at 12:52
Thank you very much Mathias! I also tried the subquery, but I mistakenly positioned the left join outside of the subquery. – Philipp Schulz Jan 24 '22 at 14:28

How do i use STRING_AGG and avoid duplicate values?

1 Answers1

Linked