0

QUESTION: How can I find Materials within Document Number's which match at least by X-Percentage (e.g. >=50%), in other Document Number's?

CREATE TABLE:

CREATE COLUMN TABLE "SCHEMA"."MYTABLE" 
(
     "DOCUMENT" NVARCHAR(10) DEFAULT '' NOT NULL ,
     "POSNR" NVARCHAR(6) DEFAULT '000000' NOT NULL ,
     "MATERIAL" NVARCHAR(40) DEFAULT '' NOT NULL,
PRIMARY KEY (
     "DOCUMENT",
     "POSNR")
     ) UNLOAD PRIORITY 5 AUTO MERGE 
;

INSERT DATA:

INSERT INTO MYTABLE VALUES (100, '10', 'R3');
INSERT INTO MYTABLE VALUES (100, '20', '7000000');
INSERT INTO MYTABLE VALUES (100, '30', '7000010');
INSERT INTO MYTABLE VALUES (100, '40', '7000011');
INSERT INTO MYTABLE VALUES (100, '50', '7000160');

INSERT INTO MYTABLE VALUES (200, '10', 'SW');
INSERT INTO MYTABLE VALUES (200, '20', '7000000');
INSERT INTO MYTABLE VALUES (200, '30', '7000010');
INSERT INTO MYTABLE VALUES (200, '40', '7000011');
INSERT INTO MYTABLE VALUES (200, '50', '7000160');
INSERT INTO MYTABLE VALUES (200, '60', '7000036');
INSERT INTO MYTABLE VALUES (200, '70', '7000040');
INSERT INTO MYTABLE VALUES (200, '80', '7000066');
INSERT INTO MYTABLE VALUES (200, '90', '7000068');

INSERT INTO MYTABLE VALUES (300, '01', '7000160');
INSERT INTO MYTABLE VALUES (300, '11', '7000011');

INSERT INTO MYTABLE VALUES (400, '10', '7000033');
INSERT INTO MYTABLE VALUES (400, '20', '7000034');
INSERT INTO MYTABLE VALUES (400, '50', '7000068');
INSERT INTO MYTABLE VALUES (400, '60', '7000079');
Robin P.
  • 3
  • 3
  • Please provide `CREATE TABLE` and `INSERT` commands don’t that there are no ambiguities about the data structure. – Lars Br. Feb 26 '20 at 10:18
  • @LarsBr. I have updated my initial post with this information. Thanks – Robin P. Feb 26 '20 at 11:44
  • Thanks for the table&data definition. Now the question is not quite clear. Do you want to find _documents_ of which X-percentage of the number of materials is shared with other documents? Please show the desired output as well. Also: is any material unique within one document? e.g. can material 700000 only occur once in document 100? – Lars Br. Feb 27 '20 at 03:06
  • @LarsBr. sorry for the confusion. The materials are not unique, e.g. document 100 might contain for POSNR 60, Material 700000. In the above example I want to find the following: document 100 has material 700000, 700010, 700011, 700160 which all 4x are also found in document 200 with 4x hits ((100/9)*4=44,4% match), in document 300 I find 2x hits out of 2x material ((100/2)*2=100% match), in document 400 I find only 1x hit (7000068), which should give me (100/4)*1=25% match – Robin P. Feb 27 '20 at 13:21
  • The added description clarified the requirement a lot better but also means I have to adjust my solution. Will post it when I get to it (next couple of days). Meanwhile, it would be good if you would explain whether the `R3` and `SW` entries should be considered here or what the rule is for materials to be included. – Lars Br. Feb 28 '20 at 02:38
  • @LarsBr. I was out so could not check this yet. But big thank you! I will try this week :-) – Robin P. Mar 16 '20 at 11:34

1 Answers1

0

This can indeed be solved without the use of cursors.

with doc_elements 
(document, material, material_cnt)  
as  (select distinct
          document
        , material
        , count( *) OVER
            (PARTITION BY document) as MATERIAL_CNT
    from
        mytable
    )  
, matched_materials 
(document_a, material, material_b_cnt, document_b, match_cnt)  
as  (select
         side_a.document as document_a
       , side_a.material
       , side_b.material_cnt as material_a_cnt
       , side_b.document doc_b
       , count(*) OVER
            (PARTITION BY side_a.document, side_b.document) as match_cnt
    from 
                        doc_elements side_a
        left outer join doc_elements side_b
                on   side_a.material = side_b.material
                and side_a.document != side_b.document
     where 
            side_b.document IS NOT NULL
    )      
select distinct
    document_a
  --, material
  , document_b
  , material_b_cnt
  , match_cnt
  , round((100/material_b_cnt)*match_cnt, 2) as match_pct
from 
    matched_materials
order by
    document_a
  , document_b;

This statement returns:

DOCUMENT_A|DOCUMENT_B|MATERIAL_B_CNT|MATCH_CNT|MATCH_PCT|
----------|----------|--------------|---------|---------|
100       |200       |             9|        4|    44.44|
100       |300       |             2|        2|      100|

200       |100       |             6|        4|    66.67|
200       |300       |             2|        2|      100|
200       |400       |             4|        1|       25|

300       |100       |             6|        2|    33.33|
300       |200       |             9|        2|    22.22|

400       |200       |             9|        1|    11.11|

To keep this simple, I treated R3 and SW as regular materials.
The output only contains document mappings that have at least one material matching (see side_b.document IS NOT NULL condition in the matched_material common table expression).

Note, that the result specification in the comments contained an error:
Document 400 does not have a match, as material 7000068 is not part of materials for document 100.


I took this question as a prompt to write about this solution a bit more extensively and also included a review of query performance and tuning options.

See https://lbreddemann.org/matchmaker/ and https://lbreddemann.org/matchmaker-quick-quick/
for that.


Lars Br.
  • 9,949
  • 2
  • 15
  • 29