1

I have following data:

ID  --- GRP_ID  --- REC_VAL
1   --- 1       --- A
2   --- 2       --- A
3   --- 2       --- B
4   --- 3       --- A
5   --- 3       --- B
6   --- 3       --- C
7   --- 4       --- A
8   --- 4       --- B
9   --- 4       --- C
10  --- 5       --- A
11  --- 5       --- B
12  --- 5       --- E

Is there a way how to find id of record groups that have same values ? (in this case only grp_id 3 and 4 have same values)

Second question:

Is there effecient way how to find exact grp_id when i had a set of values ? My solution is not very quick because table with groups has over 6mil. records:

-- Large table - up to 6m records
create table tmp_grp (id number, grp_id number, rec_val varchar2(10));
--
insert into tmp_grp(id, grp_id, rec_val) values (1, 1, 'A');
insert into tmp_grp(id, grp_id, rec_val) values (2, 2, 'A');
insert into tmp_grp(id, grp_id, rec_val) values (3, 2, 'B');
insert into tmp_grp(id, grp_id, rec_val) values (4, 3, 'A');
insert into tmp_grp(id, grp_id, rec_val) values (5, 3, 'B');
insert into tmp_grp(id, grp_id, rec_val) values (6, 3, 'C');
insert into tmp_grp(id, grp_id, rec_val) values (7, 4, 'A');
insert into tmp_grp(id, grp_id, rec_val) values (8, 4, 'B');
insert into tmp_grp(id, grp_id, rec_val) values (9, 4, 'C');
insert into tmp_grp(id, grp_id, rec_val) values (10, 5, 'A');
insert into tmp_grp(id, grp_id, rec_val) values (11, 5, 'B');
insert into tmp_grp(id, grp_id, rec_val) values (12, 5, 'E');
commit;
--
-- CTE representing record group for asking  
WITH datrec AS
 (SELECT 'A' rec FROM dual UNION ALL 
  SELECT 'B' rec FROM dual)
--
SELECT x.grp_id
  FROM (
  -- Count of joined records 
  SELECT COUNT(1) cnt, t.grp_id
          FROM tmp_grp t
          JOIN datrec d
            ON d.rec = t.rec_val
         GROUP BY t.grp_id
  -- 
  ) x
 WHERE 
 -- Count of all data records
 x.cnt = (SELECT COUNT(1) FROM datrec)
 -- Count of all group records
 AND x.cnt = (SELECT COUNT(1) FROM tmp_grp g WHERE x.grp_id = g.grp_id);
 --

This question is similar to Find group of records that match multiple values , but this topic only cover exact set of values (number of values and values in column rec of datrec will be provided by another query) and query return groups which contains this set. I need to return only exact match.

UPDATE - added data in table for better clarification

Also related to How to compare groups of tuples in sql

Community
  • 1
  • 1
  • Can there be (1) `null` and/or (2) duplicate values for `rec_val` for the same `grp_id` in the table? If so, what is the requirement when those happen? –  Mar 06 '17 at 16:01
  • Nulls nor duplicates for same grp_id won't be allowed (in real situation rec_val is sort of a FK) – Otakar Křížek Mar 06 '17 at 16:06
  • Please add the data to your INSERT statements too. –  Mar 06 '17 at 16:37
  • Can there be any number of equal groups? – BobC Mar 06 '17 at 16:46
  • mathguy - insert updated, thank you BobC - there can be multiple numbers of equal groups – Otakar Křížek Mar 06 '17 at 16:51
  • It would help to know a little more about the data. For example, one strategy may be to match `grp_id` by count of `rec_val` first, in a first pass, and then only check the actual values only for the pairs that match on count. This may help a lot if the distribution of counts is very varied, but it will not help (and actually slow things down) if man `grp_id` have the same count of `rec_val`. –  Mar 06 '17 at 16:51
  • Comparison numbers of records probably won't help, but i can try to implement "with" and "without" and benchmark it. Most grp_id have 2 records, following by 3 records and so on. I can choose strategy to "group" many records into small unique set or "group" up to 60 records into large unique set. – Otakar Křížek Mar 06 '17 at 17:08
  • OK, in that case the overhead of comparing by counts will probably be greater than the benefit. I added another suggestion for possible efficiency improvement below my answer. –  Mar 06 '17 at 17:09

1 Answers1

1

Here is a way that avoids joining the base table to itself. It will be more efficient especially if there are several (many?) possible values of rec_val for each grp_id. It can be made faster still if the distinct grp_id already exist somewhere in your data; I create them on the fly.

with gid ( grp_id ) as (
       select distinct grp_id from tmp_grp
     ),
     prep ( grp_id_1, grp_id_2, rec_val ) as (
       select t.grp_id, g.grp_id, t.rec_val
         from tmp_grp t join gid g on t.grp_id < g.grp_id
       union all
       select g.grp_id, t.grp_id, t.rec_val
         from gid g join tmp_grp t on g.grp_id < t.grp_id
     ),
     counts ( grp_id_1, grp_id_2, cnt ) as (
       select   grp_id_1, grp_id_2, count(*)
       from     prep
       group by grp_id_1, grp_id_2, rec_val
     )
select   grp_id_1, grp_id_2
from     counts
group by grp_id_1, grp_id_2
having min(cnt) = 2
;
  • @Otakar: Possible things to try to improve efficiency: (1) In order to read the base table just once, join the CTE `gid` to the base table `tmp_grp` only once, as I did in CTE `prep`, on the only condition that the `grp_id` is *different* between the two tables/CTE; then group by `least` and `greatest` of the `grp_id` in `counts`. This way the base table is read only once in `prep` instead of twice (no need for `union all`). (2) If groups by `grp_id` may have a large variety of *counts* for `rec_val`, add the counts into the picture and add equality of counts in join condition. –  Mar 06 '17 at 17:08
  • Thank you, i'll use the "only once" solution and for the counting, i'll try both variations for benchmark and choose the better one. – Otakar Křížek Mar 06 '17 at 17:11