SQL to merge rows

Question

Update

How I can select the rows from a table that when ordered

the first element matches some row
the second element matches the next row
the third element the next row after the second row
the forth element the next row after the third row
and so on until the end of values in array?

The Logic

Supposing I have these rows as result of a query (table token holds id and word and table positioning holds id and position):

 id | word | textblockid |sentence |position 
 5  | Fear |      5      |    1    |    1
 8  | of   |      5      |    1    |    2
 6  | the  |      5      |    1    |    3
 7  | Dark |      5      |    1    |    4
 9  | is   |      5      |    1    |    5

I can have this spread in the table with different textblockids, sentences and positions.

I want to transform in this:

 id  | word             | textblockid | sentence |position 
 10  | Fear of the Dark |      5      |     1    |    1
  9  | is               |      5      |     1    |    2

I'm doing a function that receives an array with the ids to merge, something like merge_tokens('{5,8,6,7}').

I insert the new word Fear of the Dark in the table token and get the generated id (as in example, id is 10). This is easy.

The Question

I need to update the id of the first word (in this case, Fear) to 10 and delete the next words(of,the,Dark).

My doubt is how I can perform these operations. I think I need to SELECT from an ordered table where the first row id matches the first element in the id array, the second row id matches the second element id array and so on, and after this, update the first element and remove the next ones.

I can't delete just delete the other rows by id because they are used in other words. I only will delete the of where the previous is Fear, the next is of and the next Dark. Following this rule, I only can delete the where the previous is of, the other previous Fear and the next is Dark.

As example, I can have in the same table something like that can't be affected:

 id  | word      | textblockid |sentence |position 
 6   | the       |      8      |    3    |    10
 11  | sound     |      8      |    3    |    21
 8   | of        |      8      |    3    |    12
 6   | the       |      8      |    3    |    13
 7   | mountain  |      8      |    3    |    14

Ah, Renanto is back for his daily fix. :) Your first paragraph is unclear: one or two tables? — Erwin Brandstetter, Nov 23 '11 at 19:34
@ErwinBrandstetter Two tables, one for text which each entrance is a unique word, and the other for positioning the words in a text which contains the textblock identification, the sentence and the position inside the sentence. — Renato Dinhani, Nov 23 '11 at 19:50
@ErwinBrandstetter And yes, daily questions because I need to hurry up with my final grade work until December (or January or February). :) After that, I will still asking too. — Renato Dinhani, Nov 23 '11 at 19:52
Renato (good luck with your grading, btw!), your question is unclear. First you write: `delete the next words(of,the,Dark)`, later you write something else. I went with the first version in my answer ... — Erwin Brandstetter, Nov 23 '11 at 20:03
@ErwinBrandstetter Let's clarify: I need to update and delete where the ids rows sequence are exactly `5,8,6,7`, but not `5,6,7,8` or `8,6,7,5` or any another possibility, but strictly `5,8,6,7`. — Renato Dinhani, Nov 23 '11 at 20:11
Please edit your question to clarify (including my previous gripes). The sequence of ids is dictated by the column `position`, right? My answer works with that. — Erwin Brandstetter, Nov 23 '11 at 20:16
Except you did not clarify the rest. Table definition is still unclear. I think my answer already addresses your update, btw. — Erwin Brandstetter, Nov 23 '11 at 20:26
Which PostgreSQL version do you use? I think this could be fairly easy with the new writeable CTEs introduced in 9.1 — , Nov 24 '11 at 17:12
@a_horse_with_no_name CTEs? I don't know what are these. And I'm using the last version, 9.1. Can you send me a link or explain? — Renato Dinhani, Nov 24 '11 at 17:26
Here are some: http://www.depesz.com/index.php/2011/03/16/waiting-for-9-1-writable-cte/ and http://johtopg.blogspot.com/2010/06/writeable-ctes.html and http://xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html — , Nov 24 '11 at 18:39

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Best do this in one transaction:

UPDATE token
SET    word = (
    SELECT string_agg(word, ' '  ORDER BY position)
    FROM   token
    WHERE  id = ANY('{5,8,6,7}'::int[])
    )
      ,id = nextval('token_id_seq')
WHERE  id = ('{5,8,6,7}'::int[])[1];

DELETE FROM token
WHERE  id = ANY('{5,8,6,7}'::int[])
AND    id <> ('{5,8,6,7}'::int[])[1];

Replace '{5,8,6,7}'::int[] with your integer array parameter.
I get the new id from the sequence I assume exists.
I further assume that the ordering in array concurs with the ordering by position. Alternative version follows below.
id to be updated is the first element of the array.

Ordering of the words can be done inside the aggregate function (since PostgreSQL 9.0). Read about that in the manual.

Answer to additional question

Order selected rows according to sequence of array elements:

SELECT rn, t.*
FROM   (
    SELECT id
          ,row_number() OVER () AS rn
    FROM (SELECT unnest('{5,8,6,7}'::int[]) id) x
    )  x
JOIN   token t USING (id)
ORDER  BY rn;

Or ... does the same with different techniques, works in older versions of Postgres, too:

SELECT rn, t.*
FROM   (
    SELECT rn
          ,a[rn] AS id
    FROM (SELECT '{5,8,6,7}'::int[] AS a
                ,generate_series(1, array_upper('{5,8,6,7}'::int[], 1)) rn) x
    )  x
JOIN   token t USING (id)
ORDER  BY rn;

Combination

Use that in the UPDATE statement:

UPDATE token
SET    word = (
    SELECT string_agg(word, ' '  ORDER BY rn)
    FROM   (
    SELECT rn
          ,a[rn] AS id
    FROM  (
           SELECT '{5,8,6,7}'::int[] AS a
                 ,generate_series(1, array_upper('{5,8,6,7}'::int[], 1)) rn) x
          ) x
    JOIN   token t USING (id)
    )
      ,id = nextval('token_id_seq')
WHERE  id = ('{5,8,6,7}'::int[])[1];

score 1 · Answer 2 · answered Nov 23 '11 at 20:41

This fragment does not use arrays. (I don't like arrays)

set search_path='tmp';

DROP TABLE wordlist;
CREATE TABLE wordlist
    ( id INTEGER NOT NULL PRIMARY KEY
    , word varchar
    , textblockid INTEGER NOT NULL
    , sentence INTEGER NOT NULL
    , postion INTEGER NOT NULL
    , UNIQUE (textblockid,sentence,postion)
    );

INSERT INTO wordlist(id,word,textblockid,sentence,postion) VALUES
 (5 , 'Fear', 5 , 1 , 1 )
,(8 , 'of', 5 , 1 , 2 )
,(6 , 'the', 5 , 1 , 3 )
,(7 , 'Dark', 5 , 1 , 4 )
,(9 , 'is', 5 , 1 , 5 )
    ;

WITH RECURSIVE meuk AS (
    SELECT 0 AS lev
        , id,word AS words
        , textblockid,sentence,postion AS lastpos
    FROM wordlist
    UNION
    SELECT 1+ mk.lev AS lev
        , wl.id
        , mk.words || ' '::text || wl.word AS words
        , wl.textblockid,wl.sentence
        , wl.postion AS lastpos
    FROM meuk mk
    JOIN wordlist wl ON (wl.textblockid = mk.textblockid
        AND wl.sentence = mk.sentence
        AND wl.postion = mk.lastpos+1)
    )
SELECT * FROM meuk
WHERE lev = 3
    ;

results:

SET
DROP TABLE
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "wordlist_pkey" for table "wordlist"
NOTICE:  CREATE TABLE / UNIQUE will create implicit index "wordlist_textblockid_sentence_postion_key" for table "wordlist"
CREATE TABLE
INSERT 0 5
 lev | id |      words       | textblockid | sentence | lastpos 
-----+----+------------------+-------------+----------+---------
   3 |  7 | Fear of the Dark |           5 |        1 |       4
   3 |  9 | of the Dark is   |           5 |        1 |       5
(2 rows)

BTW I don't understand your question at all. Why/what do you want to delete? But I think the recursive query can serve you to at least find the rows you want to select/delete/update. — wildplasser, Nov 23 '11 at 21:14

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

After answering most of your recent questions I have a vague idea of what you are doing. So I had a closer look at your solution and optimized quite a bit. Mostly I simplified the code, but there are some substantial improvements, too.

Some points:

Don't use the undocumented assignment operator = in plpgsql. Use := instead. See this related question for more info.
Why LOOP BEGIN? A separate code block only slows down if you don't need it. Removed it.
Many more, I added a few comments

Please have a look at the code side-by-side for some hints.
Test the two versions to see which performs faster.

For your consideration:

CREATE OR REPLACE FUNCTION merge_tokens(words varchar[], separator varchar)
  RETURNS VOID AS
$body$
DECLARE         
    r              record;
    current_id     integer;
    ids            integer[];
    generated_word varchar :=  '';  -- you can initialize variables at declaration time. Saves additional assignment.

BEGIN
    -- get the ids and generate the word
    RAISE NOTICE 'Getting ids and generating words';
    generated_word := array_to_string(words, separator);  -- 1 assignment is much cheaper. Also: no trim() needed.
    ids := ARRAY
    (  SELECT t.id
       FROM  (
          SELECT row_number() OVER () AS rn, text
          FROM  (SELECT unnest(words) AS text) x) y
          JOIN   token t USING (text)
       ORDER  BY rn);
    RAISE NOTICE 'Generated word: %', generated_word;

    -- check if the don't exists to insert it
    SELECT INTO current_id  t.id FROM token t WHERE t.text = generated_word; 
    IF NOT FOUND THEN
        RAISE NOTICE 'Word don''t exists';
        INSERT INTO token(text) VALUES(generated_word)
        RETURNING id
        INTO current_id;  --get the last value without additional query.
    END IF;
    RAISE NOTICE 'Word id: %', current_id;

    -- select the records that will be updated
    RAISE NOTICE 'Getting words to be updated.';
    FOR r IN
        SELECT textblockid, sentence, position, tokenid, rn
        FROM
        ( -- select the rows that are complete
          SELECT textblockid, sentence, position, tokenid, rn, count(*) OVER (PARTITION BY grp) AS counting
          FROM
          ( -- match source with lookup table
                SELECT source.textblockid, source.sentence, source.position, source.tokenid, source.rn, source.grp
                FROM
                (   -- select textblocks where words appears with row number to matching
                     SELECT tb.textblockid, tb.sentence, tb.position, tb.tokenid, grp
                                           ,CASE WHEN grp > 0 THEN
                                            row_number() OVER (PARTITION BY grp ORDER BY tb.textblockid, tb.sentence, tb.position)
                                            END AS rn               
                     FROM
                     (   -- create the groups to be used in partition by to generate the row numbers
                          SELECT tb.textblockid, tb.sentence, tb.position, tb.tokenid
                                ,SUM(CASE WHEN tb.tokenid = ids[1] THEN 1 ELSE 0 END) OVER (ORDER BY tb.textblockid, tb.sentence, tb.position) AS grp
                          FROM  textblockhastoken tb
                          JOIN
                          (   --select the textblocks where the word appears
                                SELECT textblockid, sentence
                                FROM   textblockhastoken tb
                                WHERE  tb.tokenid = ids[1]
                          ) res USING (textblockid, sentence)
                     ) tb
                ) source
                -- create the lookup table to match positions
                JOIN (SELECT row_number() OVER () as rn, id FROM unnest(ids) AS id) lookup USING (rn)
                WHERE source.tokenid = lookup.id
          ) merged
        ) g  
        WHERE g.counting = array_length(ids,1)
        ORDER BY g.rn --order by row number to update first, delete and change positions after
    LOOP
        --check if update or delete
        IF (r.rn = 1) THEN
            RAISE NOTICE 'Updating word in T:% S:% P:%', r.textblockid, r.sentence, r.position;
            UPDATE textblockhastoken tb SET tokenid = current_id
            WHERE (tb.textblockid, tb.sentence, tb.position)
                = ( r.textblockid,  r.sentence,  r.position);
        ELSE
            RAISE NOTICE 'Deleting word in T:% S:% P:%', r.textblockid, r.sentence, r.position;
            DELETE FROM textblockhastoken tb
            WHERE (tb.textblockid, tb.sentence, tb.position)
                = ( r.textblockid,  r.sentence,  r.position);
        END IF;
        --check if is the last word to update the positions
        IF (r.rn = array_length(ids,1)) THEN
            RAISE NOTICE 'Changing positions in T:% S:%', r.textblockid, r.sentence;
            UPDATE textblockhastoken tb SET position = new_position
            FROM
            (   SELECT textblockid, sentence, position
                      ,row_number() OVER (PARTITION BY tb.textblockid, tb.sentence ORDER BY tb.position) as new_position
                FROM   textblockhastoken tb
                WHERE  tb.textblockid = r.textblockid AND tb.sentence = r.sentence
            ) np
            WHERE (tb.textblockid, tb.sentence, tb.position)
                = (np.textblockid, np.sentence, np.position)
            AND    tb.position <> np.new_position;
        END IF;
    END LOOP;
END;
$body$ LANGUAGE plpgsql;

Thank you. I will test this code and I think it will work. I'm learning SQL and PG/PLSQL more deeper since 2 weeks ago, but I'm still a beginner in this area, so my code have a lot of faults, but thank you for your corrections. And the concept of my project is simple: a webcrawler and a data analyser that classifies the downloaded text to some use (I cutted off the usage because will not have time to develop this part in my project). — Renato Dinhani, Nov 26 '11 at 03:37
@RenatoDinhaniConceição: actually, your code has become *much* better over the last few weeks. You have picked up quite a few of the hints you got here. — Erwin Brandstetter, Nov 26 '11 at 03:41
Can I have your opinion about one thing? A friend said me that having much sub-queries is not recommended and can be signal of a design fault. In this procedure, there is a lot of subqueries because the logic a bit complex, but I think my design is correct and I don't see a way to improve (the only way is creating a materialized view with some results). Do you think that subqueries are evil and their existence is a signal of a design fault? — Renato Dinhani, Nov 26 '11 at 03:56
@RenatoDinhaniConceição: Not at all. The same rule applies for subqueries that applies to any computation: use as few a possible but as many as necessary. Sure, there may be improvements possible to your design, but you cannot judge that from a couple of subqueries in a query. The resulting performance will be the judge of that. — Erwin Brandstetter, Nov 26 '11 at 04:04
The `ids` is being interpreted as `VARCHAR` in the `WHERE` clauses. It's need a conversion in the array when it created with the `ARRAY()` function? — Renato Dinhani, Nov 26 '11 at 14:00
Ok, let's explaing better: in places where have code similar as `WHERE tb.tokenid = ids[1]`, I'm getting an error message saying that is impossible compare a INTEGER to a VARCHAR. To solve this, I putted conversions and the working code is like this: `WHERE tb.tokenid = ids[1]::INTEGER`. I don't understand why I'm getting this error because ids is declared as `INTEGER[]`, but the error is occurring. I think that is something related to the `ARRAY`function that is creating the `ids` array. — Renato Dinhani, Nov 26 '11 at 15:11
@RenatoDinhaniConceição: That does not make sense at all. There must be a misunderstanding. `ids[1]` is integer, casting it to integer is a pointless operation. Is `tokenid` defined as `varchar`? Or the error applies to a different part of the code? — Erwin Brandstetter, Nov 26 '11 at 15:18
My fault. I was doing some maintenance in the code and changed the ids to `VARCHAR[]` ¬¬. It's working fine. Later I will post a simplified version of the query with less rows and subqueries. — Renato Dinhani, Nov 26 '11 at 15:32

score 0 · Answer 4 · answered Nov 23 '11 at 19:33

0

Is this something you could do as part of your merge_tokens function? Seems like you could just have that function keep track of which records need to be updated/deleted, simply based on the provided array (first element updated, the rest deleted).

answered Nov 23 '11 at 19:33

Jake Feasel

16,785
5
53
66

score 0 · Answer 5 · edited May 23 '17 at 10:08

This answer is for my particular case. I don't know if is the best way, but works for me.

I build this procedure with answer from these questions: Is possible have different conditions for each row in a query? and How create a WINDOW in PostgreSQL until the same value appears again?

The FOREARCH is only working in PostgreSQL 9.1.

CREATE OR REPLACE FUNCTION merge_tokens(words VARCHAR[], separator VARCHAR)
RETURNS VOID
AS $$
DECLARE         
    r RECORD;
    current_id INTEGER;
    current_word VARCHAR;       
    ids INTEGER[];
    generated_word VARCHAR;

BEGIN       
    -- get the ids and generate the word
    RAISE NOTICE 'Getting ids and generating words';
    generated_word = '';
    FOREACH current_word IN ARRAY words
    LOOP BEGIN                      
        generated_word = generated_word || current_word;
        generated_word = generated_word || separator;
        SELECT t.id INTO current_id FROM token t WHERE t.text = current_word;
        ids = ids || current_id;
    END;
    END LOOP;

    -- remove lead and ending spacing in word
    RAISE NOTICE 'Generated word: %', generated_word;
    generated_word = TRIM(generated_word);

    -- check if the don't exists to insert it
    SELECT t.id INTO current_id FROM token t WHERE t.text = generated_word; 
    IF (current_id IS NULL) THEN
        RAISE NOTICE 'Word don''t exists';
        INSERT INTO token(id,text) VALUES(nextval('tokenidsqc'),generated_word);
        current_id = lastval(); --get the last value from the sequence      
    END IF;
    RAISE NOTICE 'Word id: %', current_id;

    -- select the records that will be updated
    RAISE NOTICE 'Getting words to be updated.';
    FOR r IN SELECT grouping.textblockid, grouping.sentence, grouping.position, grouping.tokenid, grouping.row_number
    FROM
    (
        -- select the rows that are complete
        SELECT merged.textblockid, merged.sentence, merged.position, merged.tokenid,merged.row_number,count(*) OVER w as counting           
        FROM
        (
            -- match source with lookup table
            SELECT source.textblockid, source.sentence, source.position, source.tokenid,source.row_number, source.grp
            FROM
            (   -- select textblocks where words appears with row number to matching
                SELECT tb.textblockid, tb.sentence, tb.position, tb.tokenid, grp,
                    CASE WHEN grp > 0 THEN
                        row_number() OVER (PARTITION BY grp ORDER BY tb.textblockid,tb.sentence,tb.position)
                    END AS row_number               
                FROM
                (   -- create the groups to be used in partition by to generate the row numbers
                    SELECT tb.textblockid, tb.sentence, tb.position, tb.tokenid,
                        SUM(CASE WHEN tb.tokenid = ids[1] THEN 1 ELSE 0 END) OVER (ORDER BY tb.textblockid,tb.sentence,tb.position) AS grp
                    FROM textblockhastoken tb,
                    (   --select the textblocks where the word appears
                        SELECT textblockid, sentence
                        FROM textblockhastoken tb
                        WHERE tb.tokenid = ids[1]
                    )res
                    WHERE tb.textblockid = res.textblockid
                    AND tb.sentence = res.sentence                      
                )tb
            )source,
            -- create the lookup table to match positions
            (
                SELECT row_number() OVER () as row_number,id FROM unnest(ids::INTEGER[]) as id
            )lookup
            WHERE source.tokenid = lookup.id
            AND source.row_number = lookup.row_number
        )merged
        WINDOW w AS (PARTITION BY grp)
    ) grouping      
    WHERE grouping.counting = array_length(ids,1)
    ORDER BY grouping.row_number --order by row number to update first, delete and change positions after
    -- end of query and start of iterations actions
    LOOP BEGIN
        --check if update or delete
        IF (r.row_number = 1) THEN
            RAISE NOTICE 'Updating word in T:% S:% P:%', r.textblockid, r.sentence, r.position;
            UPDATE textblockhastoken tb SET tokenid = current_id
            WHERE tb.textblockid = r.textblockid 
            AND tb.sentence = r.sentence
            AND tb.position = r.position;
        ELSE
            RAISE NOTICE 'Deleting word in T:% S:% P:%', r.textblockid, r.sentence, r.position;
            DELETE FROM textblockhastoken tb
            WHERE tb.textblockid = r.textblockid 
            AND tb.sentence = r.sentence
            AND tb.position = r.position;
        END IF;
        --check if is the last word to update the positions
        IF (r.row_number = array_length(ids,1)) THEN
            RAISE NOTICE 'Changing positions in T:% S:%', r.textblockid, r.sentence;
            UPDATE textblockhastoken tb SET position = new_position
            FROM
            (   
                SELECT textblockid, sentence, position, row_number() OVER w as new_position
                FROM textblockhastoken tb
                WHERE tb.textblockid = r.textblockid AND tb.sentence = r.sentence
                WINDOW w AS (PARTITION BY tb.textblockid, tb.sentence ORDER BY tb.position)             
            )new_positioning                
            WHERE tb.textblockid = new_positioning.textblockid 
            AND tb.sentence = new_positioning.sentence
            AND tb.position = new_positioning.position
            AND tb.position <> new_positioning.new_position;
        END IF;
    END;
    END LOOP;
END 
$$
LANGUAGE plpgsql;