0

I have a large sourceTable (300 Million rows) with several columns:

id sequence orientation gc_content – etc

most sequences appear only once in the table, but others might be in there much more often

I need to know how often each sequence appears in the sourceTable, so I created a table targetTable with 3 columns:

id sequence frequency

The column sequence has been set to be unique.

I filled this targetTable with all distinct sequence values and set the frequency to 0 by default in each.

Now I want to select all the sequences from the sourceTable and "insert" them into the targetTable (which can't happen) and ON CONFLICT increase the frequency for that sequence by 1 (not sure that is the most intelligent way of going about it).

The code works fine for the first part

INSERT INTO targetTable (sequence, frequency)
(SELECT sequence, 1
FROM sourceTable)
ON CONFLICT (sequence)
DO NOTHING

but when I try

INSERT INTO targetTable (sequence, frequency)
(SELECT sequence, 1
FROM sourceTable)
ON CONFLICT (sequence)
--DO NOTHING
DO UPDATE SET
frequency = targetTable.frequency + 1
WHERE
sequence = targetTable.sequence -- = sourceTable.sequence
;

I get the message that sequence is ambiguous.

To make sequence not ambiguous I tried

WHERE
targetTable.sequence = sourceTable.sequence

and

WHERE
sourceTable.sequence = targetTable.sequence 

but then I get "ERROR: missing FROM-clause entry for table "sourceTable" "

What am I missing here?

Thanks for any advice - I would have liked to figure that one out on my own as it seems like a common problem that should be easy to research on the internet, but I banged my head against that since yesterday morning and am running out of time …

1 Answers1

1

The virtual table associated with failed INSERT rows is named EXCLUDED, so the SQL would be as follows, except that each row can only be affected once:

INSERT INTO targetTable (sequence, frequency)
SELECT sequence, 1
  FROM sourceTable
ON CONFLICT (sequence)
  DO UPDATE SET frequency = targetTable.frequency + 1
  WHERE EXCLUDED.sequence = targetTable.sequence;

This is documented in PostgreSQL 15 Documention - INSERT SQL Command. Note, the SELECT clause didn't need to be enclosed in parentheses. Avoid using sequence as a column identifier since it is a SQL key word.

A proper approach to get the counts would be to use the COUNT aggregate function grouped by sequence:

INSERT INTO targetTable(sequence, frequency)
SELECT sequence, COUNT(*) AS frequency
  FROM sourceTable
  GROUP BY sequence;
JohnH
  • 2,001
  • 1
  • 2
  • 13
  • Thanks, but that results in another error: "ERROR: ON CONFLICT DO UPDATE command cannot affect row a second time" – Markus Winter May 16 '23 at 05:01
  • @MarkusWinter, after I posted, I realized that I neglected to include that detail. I've revised my answer with a working solution as well as my original response addressing the ambiguous column exception. – JohnH May 16 '23 at 05:14
  • Thanks John, the second solution works beautifully and very quickly too! 40 sec for a table with 33 Million rows on my 13 year old Mac (from 2010!). Perfect! – Markus Winter May 16 '23 at 05:21