How to secure table for avoid duplicate data

Question

I cant resolve the problem how secure my table to avoid duplicate combination of attributes_positions. The best way to show you what I mean is the following image

column id_combination represents number of combination. Combination consists of attributes_positions. So Combination is sequence of attributes_positions.

And now I would secure table from insert exaclty the same sequence of attributes_positions.

Of course if already inserted combination contains one additional attributes_positions or one less than inserting combination is ok

image I show the different bettwen duplicate and not duplicate combination.

Is there a some way how I can do that?? Meaby something like 'before update'. But how to implement for this example. I`m not so pretty good with advanced sql. The database where I trying to secure table is postgresql 9.4

I will be grateful for help

You want unique **sets**, with `id_combination` the identifier for the set? — wildplasser, Feb 25 '17 at 17:19
Yes. Exaclty. I need unique sets of attributes_positions with id_combination the identifier for the set — Michał Ziembiński, Feb 25 '17 at 17:35
*must* the attributes_positions be consecutive (and start at 1) ? — wildplasser, Feb 25 '17 at 17:37
No it dosent matter how attributes_positions be consecutive. They can be mixed and not ordered. They also dont need to start from 1. They not increment. They only contain foregin key to other table — Michał Ziembiński, Feb 25 '17 at 17:57
Enumerating the sets comes very close to relational division. [working on it] — wildplasser, Feb 25 '17 at 19:17
I think this question deserves a better title, like `how to impose constraints on (non) duplicate sets (in an N:M junction table) — wildplasser, Feb 26 '17 at 20:38

score 0 · Answer 1 · answered Feb 25 '17 at 23:53

My answer assumes that the target is without dupes, and that we want to insert a new set - which happens to be a duplicate. I choose the group of 4 with the id_comb of 1.

You would have to put the group of 4 into a staging table. Then, you have to pivot both staging and target horizontally - so that you get 5 columns named attr_pos1 to attr_pos5 (the biggest group in your example is 5). To pivot, you need a sequence number, which we get by using ROW_NUMBER(). That's for both tables, staging and target. Then, you pivot both. Then, you try to join pivoted staging and target on all 5 attr_pos# columns, and count the rows. If you get 0, you have no duplicates. If you get 1, you have duplicates.

Here's the whole scenario:

    WITH
    -- input section: a) target table, no dupes
    target(id_comb,attr_pos) AS (
                        SELECT 2,1
    UNION ALL SELECT 2,2
    UNION ALL SELECT 2,3
    UNION ALL SELECT 2,4
    UNION ALL SELECT 3,1
    UNION ALL SELECT 3,2\
UNION ALL SELECT 3,3
UNION ALL SELECT 3,4
UNION ALL SELECT 3,5
UNION ALL SELECT 4,1
UNION ALL SELECT 4,2
UNION ALL SELECT 4,3
)
,
-- input section: b) staging, input, would be a dupe
staging(id_comb,attr_pos) AS (
          SELECT 1,1
UNION ALL SELECT 1,2
UNION ALL SELECT 1,3
UNION ALL SELECT 1,4
)
,
-- query section:
-- add sequence numbers to stage and target
target_s AS (
SELECT 
  ROW_NUMBER() OVER(PARTITION BY id_comb ORDER BY attr_pos) AS seq
, *
FROM target
)
,
staging_s AS (
SELECT 
  ROW_NUMBER() OVER(PARTITION BY id_comb ORDER BY attr_pos) AS seq
, *
FROM staging
)
,
-- horizontally pivot target, NULLS as -1 for later join
target_h AS (
SELECT 
  id_comb
, IFNULL(MAX(CASE seq WHEN 1 THEN attr_pos END),-1) AS attr_pos1
, IFNULL(MAX(CASE seq WHEN 2 THEN attr_pos END),-1) AS attr_pos2
, IFNULL(MAX(CASE seq WHEN 3 THEN attr_pos END),-1) AS attr_pos3
, IFNULL(MAX(CASE seq WHEN 4 THEN attr_pos END),-1) AS attr_pos4
, IFNULL(MAX(CASE seq WHEN 5 THEN attr_pos END),-1) AS attr_pos5
FROM target_s
GROUP BY id_comb ORDER BY id_comb
)
,
-- horizontally pivot staging, NULLS as -1 for later join
staging_h AS (
SELECT 
  id_comb
, IFNULL(MAX(CASE seq WHEN 1 THEN attr_pos END),-1) AS attr_pos1
, IFNULL(MAX(CASE seq WHEN 2 THEN attr_pos END),-1) AS attr_pos2
, IFNULL(MAX(CASE seq WHEN 3 THEN attr_pos END),-1) AS attr_pos3
, IFNULL(MAX(CASE seq WHEN 4 THEN attr_pos END),-1) AS attr_pos4
, IFNULL(MAX(CASE seq WHEN 5 THEN attr_pos END),-1) AS attr_pos5
FROM staging_s
GROUP BY id_comb ORDER BY id_comb
)
SELECT 
  COUNT(*)
FROM target_h
JOIN staging_h USING (
  attr_pos1
, attr_pos2
, attr_pos3
, attr_pos4
, attr_pos5
);

Hope this helps ---- Marco

wildplasser · Accepted Answer · 2017-02-26T20:42:30.573

        -- The data
CREATE TABLE theset (
        set_id INTEGER NOT NULL PRIMARY KEY
        , set_name text UNIQUE
        );
INSERT INTO theset(set_id, set_name) VALUES
( 1, 'one'), ( 2, 'two'), ( 3, 'three'), ( 4, 'four');

CREATE TABLE theitem (
        item_id integer NOT NULL PRIMARY KEY
        , item_name text UNIQUE
        );
INSERT INTO theitem(item_id, item_name) VALUES
( 1, 'one'), ( 2, 'two'), ( 3, 'three'), ( 4, 'four'), ( 5, 'five');

CREATE TABLE set_item (
        set_id integer NOT NULL REFERENCES theset (set_id)
        , item_id integer NOT NULL REFERENCES theitem(item_id)
        , PRIMARY KEY (set_id,item_id)
        );
        -- swapped index is indicated for junction tables
CREATE UNIQUE INDEX ON set_item(item_id, set_id);

INSERT INTO set_item(set_id,item_id) VALUES
(1,1), (1,2), (1,3), (1,4),
(2,1), (2,2), (2,3), -- (2,4),
(3,1), (3,2), (3,3), (3,4), (3,5),
(4,1), (4,2), (4,4);

CREATE FUNCTION set_item_unique_set( ) RETURNS TRIGGER AS
$func$
BEGIN
IF EXISTS ( -- other set
        SELECT * FROM theset oth
        -- WHERE oth.set_id <> NEW.set_id -- only for insert/update
        WHERE TG_OP = 'DELETE' AND oth.set_id <> OLD.set_id
           OR TG_OP <> 'DELETE' AND oth.set_id <> NEW.set_id

        -- count (common) members in the two sets
        -- items not in common will have count=1
        AND NOT EXISTS (
                SELECT item_id FROM set_item x1
                WHERE (x1.set_id = NEW.set_id OR x1.set_id = oth.set_id )
                GROUP BY item_id
                HAVING COUNT(*) = 1
                )

        ) THEN
        RAISE EXCEPTION 'Not unique set';
        RETURN NULL;
ELSE
        RETURN NEW;
END IF;

END;
$func$ LANGUAGE 'plpgsql'
        ;

CREATE CONSTRAINT TRIGGER check_item_set_unique
        AFTER UPDATE OR INSERT OR DELETE
        -- BEFORE UPDATE OR INSERT
        ON set_item
        FOR EACH ROW
        EXECUTE PROCEDURE set_item_unique_set()
        ;

-- Test it
INSERT INTO set_item(set_id,item_id) VALUES(4,5); -- success
INSERT INTO set_item(set_id,item_id) VALUES(2,4); -- failure
DELETE FROM set_item WHERE set_id=1 AND item_id= 4; -- failure

Note: There should also be a trigger for the DELETE case.

UPDATE: added handling of DELETE

(the handling of deletes is not perfect; imagine the case where the last element from a set is removed)

Probably `(TG_OP = 'DELETE' AND oth.set_id <> OLD.set_id OR TG_OP <> 'DELETE' AND oth.set_id <> NEW.set_id)` — Roman Tkachuk, Feb 27 '17 at 08:20
Thank you! It is exactly what i looking for. But I cosider the one more thing. It is possible to set the trigger on something like on commit ? Not on update or deletet each row. Beacuse probably will be problem put new sequence which contain already existed attributes_positions. **INSERT INTO set_item(set_id,item_id) VALUES(2,4);** // should failure but this should work **INSERT INTO set_item(set_id,item_id) VALUES(2,4);** **INSERT INTO set_item(set_id,item_id) VALUES(2,5);** **INSERT INTO set_item(set_id,item_id) VALUES(2,6);** // after this 3 we have new sequence — Michał Ziembiński, Feb 27 '17 at 13:08
@MichałZiembiński What you mean is `deferrable` . Add it to the constraint definition(s). Plus: `initially deferred`. — joop, Feb 27 '17 at 13:16

score 0 · Answer 3 · edited Jun 20 '20 at 09:12

Interesting but not very useful solution by @wildplasser. I create script to insert sample data:

WITH param AS (
    SELECT 8 AS max
), maxarray AS (
    SELECT array_agg(i) as ma FROM (SELECT generate_series(1, max) as i FROM param) as i
), pre AS (
    SELECT
        *
    FROM (
    SELECT
         *, CASE WHEN (id >> mbit) & 1 = 1 THEN ma[mbit + 1] END AS item_id
     FROM (
            SELECT *,
                generate_series(0, array_upper(ma, 1) - 1) as mbit
            FROM (
                    SELECT *,
                        generate_series(1,(2^max - 1)::int8) AS id
                    FROM param, maxarray
                ) AS pre1
        ) AS pre2
    ) AS pre3
    WHERE item_id IS NOT NULL
), ins_item AS (
    INSERT INTO theitem (item_id, item_name) SELECT i, i::text FROM generate_series(1, (SELECT max FROM param)) as i RETURNING *
), ins_set AS (
INSERT INTO theset (set_id, set_name)
SELECT id, id::text FROM generate_series(1, (SELECT 2^max - 1 FROM param)::int8) as id
RETURNING *
), ins_set_item AS (
INSERT INTO set_item (set_id, item_id)
SELECT id, item_id FROM pre WHERE (SELECT count(*) FROM ins_item) > 0 AND (SELECT count(*) FROM ins_set) > 0
RETURNING *
)
SELECT
    'sets', count(*)
FROM ins_set
UNION ALL
SELECT
    'items', count(*)
FROM ins_item
UNION ALL
SELECT
    'sets_items', count(*)
FROM ins_set_item
;

When I call it with 8 (1024 - 2^8 rows for set_item) it run 21 seconds. It is very bad. When I off trigger it took less then 1 milliseconds.

My proposal

It is very interesting to use arrays in this case. Unfortunatelly PostgreSQL does not support foreighn key for arrays, but it may be done by TRIGGERs. I remove set_item table and add items int[] field for theset:

        -- The data
CREATE TABLE theitem (
        item_id integer NOT NULL PRIMARY KEY
        , item_name text UNIQUE
        );

CREATE TABLE theset (
        set_id INTEGER NOT NULL PRIMARY KEY
        , set_name text UNIQUE
        , items integer[] UNIQUE NOT NULL
        );

CREATE INDEX i1 ON theset USING gin (items);

CREATE OR REPLACE FUNCTION check_item_CU() RETURNS TRIGGER AS $sql$
BEGIN
    IF (SELECT count(*) > 0 FROM unnest(NEW.items) AS u LEFT JOIN theitem ON (item_id = u) WHERE item_id IS NULL) THEN
        RETURN NULL;
    END IF;

    NEW.items = ARRAY(SELECT unnest(NEW.items) ORDER BY 1);

    RETURN NEW;
END;
$sql$ LANGUAGE plpgsql; 

CREATE TRIGGER check_item_CU BEFORE INSERT OR UPDATE ON theset FOR EACH ROW EXECUTE PROCEDURE check_item_CU();

CREATE OR REPLACE FUNCTION check_item_UD() RETURNS TRIGGER AS $sql$
BEGIN
    IF (TG_OP = 'DELETE' OR TG_OP = 'UPDATE' AND NEW.item_id != OLD.item_id) AND (SELECT count(*) > 0 FROM theset WHERE OLD.item_id = ANY(items)) THEN
        RAISE EXCEPTION 'item_id % still used', OLD.item_id;
        RETURN NULL;
    END IF;

    RETURN NEW;
END;
$sql$ LANGUAGE plpgsql; 

CREATE TRIGGER check_item_UD BEFORE DELETE OR UPDATE ON theitem FOR EACH ROW EXECUTE PROCEDURE check_item_UD();

WITH param AS (
    SELECT 10 AS max
), maxarray AS (
    SELECT array_agg(i) as ma FROM (SELECT generate_series(1, max) as i FROM param) as i
), pre AS (
    SELECT
        *
    FROM (
    SELECT
         *, CASE WHEN (id >> mbit) & 1 = 1 THEN ma[mbit + 1] END AS item_id
     FROM (
            SELECT *,
                generate_series(0, array_upper(ma, 1) - 1) as mbit
            FROM (
                    SELECT *,
                        generate_series(1,(2^max - 1)::int8) AS id
                    FROM param, maxarray
                ) AS pre1
        ) AS pre2
    ) AS pre3
    WHERE item_id IS NOT NULL
), pre_arr AS (
SELECT id, array_agg(item_id) AS items
FROM pre
GROUP BY 1
), ins_item AS (
    INSERT INTO theitem (item_id, item_name) SELECT i, i::text FROM generate_series(1, (SELECT max FROM param)) as i RETURNING *
), ins_set AS (
INSERT INTO theset (set_id, set_name, items)
SELECT id, id::text, items FROM pre_arr WHERE (SELECT count(*) FROM ins_item) > 0
RETURNING *
)
SELECT
    'sets', count(*)
FROM ins_set
UNION ALL
SELECT
    'items', count(*)
FROM ins_item

;

This variant run less than 1ms

No pr. Storing 2300 set_items (max = 9) took 3m29s for your approach. This does not allow to live in real world. For example with arrays storing 53000 set_items (max = 13) took with arrays still less than 1sec. — Roman Tkachuk, Feb 28 '17 at 05:53

How to secure table for avoid duplicate data

3 Answers3

My proposal