Relational databse design to represent similarity between rows of same table

Question

For background purposes: I'm using PostgreSQL with SQLAlchemy (Python).

Given a table of unique references as such:

references_table
-----------------------
  id | reference_code
-----------------------
   1 | CODEABCD1
   2 | CODEABCD2
   3 | CODEWXYZ9
   4 | CODEPOIU0
...

In a typical scenario, I would have a separate items table:

items_table
-----------------------
  id | item_descr
-----------------------
   1 | `Some item A`
   2 | `Some item B`
   3 | `Some item C`
   4 | `Some item D`
...

In such typical scenario, the many-to-many relationship between references and items is set in a junction table:

references_to_items
-----------------------
  ref_id (FK) | item_id (FK)
-----------------------
   1          | 4
   2          | 1
   3          | 2
   4          | 1
...

In that scenario, it is easy to model and obtain all references that are associated to the same item, for instance item 1 has references 2 and 4 as per table above.

However, in my scenario, there is no items_table. But I would still want to model the fact that some references refer to the same (non-represented) item.

I see a possibility to model that via a many-to-many junction table as such (associating FKs of the references table):

reference_similarities
-----------------------
  ref_id (FK) | ref_id_similar (FK)
-----------------------
   2          | 4
   2          | 8
   2          | 9
...

Where references with ID 2, 4, 8 and 9 would be considered 'similar' for the purposes of my data model.

However, the inconvenience here is that such model requires to choose one reference (above id=2) as a 'pivot', to which multiple others can be declared 'similar' in the reference_similarities table. Ref 2 is similar to 4 and ref 2 is similar to 8 ==> thus 4 is similar to 8.

So the question is: is there a better design that doesn't involve having a 'pivot' FK as above?

Ideally, I would store the 'similarity' as an Array of FKs as such:

reference_similarities
------------------------
id  |  ref_ids (Array of FKs)
------------------------
  1 | [2, 4, 8, 9]
  2 | [1, 3, 5]

..but I understand from https://dba.stackexchange.com/questions/60132/foreign-key-constraint-on-array-member that it is currently not possible to have foreign keys in PostgreSQL arrays. So I'm trying to figure out a better design for this model.

score 1 · Answer 1 · answered Mar 17 '20 at 04:43

I can understand that you want to group items in a set, and able to query the set from any of item in it.

You can use a hash function to hash a set, then use the hash as pivot value. For example you have a set of values (2,4,8,9), it will be hashed like this:

hash = ((((31*1 + 2)*31 + 4)*31 + 8)*31 + 9

you can refer to Arrays.hashCode in Java to know how to hash a list of values.

        int result = 1;

        for (Object element : a)
            result = 31 * result + (element == null ? 0 : element.hashCode());

Table reference_similarities:

reference_similarities
-----------------------
  ref_id (FK) | hash_value
-----------------------
   2          | hash(2, 4, 8, 9) = 987204
   4          | 987204
   8          | 987204
   9          | 987204

To query the set, you can first query hash_value from ref_id first, then, get all ref_id from hash_value.

The draw back of this solution is every time you add a new value to a set, you have to rehash the set.

Another solution is you can just write a function in Python to produce a unique hash_value when creating a new set.

Interesting solution! I will try to implement in the following days in Python see how it works (for 'hashing' I am thinking of using Base64 on the `repr` of a ordered list). An alternative to this structure would be to use the `reference_similarities` table with array of IDs (which are not FKs since that is not supported) [actually the array would not even be necessary I think], and instead of having a hash value, insert the `id` of the `references_similarities` as FK in `references_table` — Jean Monet, Mar 17 '20 at 12:57

Relational databse design to represent similarity between rows of same table

1 Answers1