0

Given a table with dozens of columns, many thousands of rows and no known functional dependencies between the columns, is there any algorithm to automatically find a candidate key for that table that is faster or more efficient than this:

  1. Check if any single column is a candidate key
  2. Check if any 2 columns are a candidate key

  3. Check if any 3 columns are a candidate key

... and so on until a key is found?

Marco
  • 313
  • 1
  • 2
  • 11
  • What do you mean, "no known functional dependencies"? That can't be right, there are always trivial FDs. Do you mean, no non-trivial FDs hold? Why do you say that? What exactly are you given? (You're sure? That means no given superkeys, CKs, PKs or UNIQUEs, because those imply that certain FDs hold & don't hold.) Then its only CK is the set of all attributes. Do you mean, they might hold but you don't know? You need to be able to determine the set of all FDs that hold to find CKs. The set of all attributes is always a superkey. What do you mean, "check"? How do you expect to do that? – philipxy Apr 29 '21 at 19:45

1 Answers1

0

The primary key of the table has to be unique. If your dataset grows over time it is not sufficient to check whether the columns are now unique. They will have to be unique for ever. So in this case I would avoid a natural PK and choose a surrogate key.

If you are 100% sure that there must be a natural key but you just don't know the columns you start with determining the distinct cardinality of the colums:

SELECT COUNT(*),
       COUNT(DISTINCT column1),
       COUNT(DISTINCT column3),
       ...
FROM table

If for any column the distinct number of values is equal to the total number of rows you have found the key. If it is multi columns you need to take a try-and-error approach. You know that the product of the distinct column values of your key candidates has to be greater or equal to the total number of rows

COUNT(DISTINCT key_column1) * COUNT(DISTINCT key_column2) * COUNT(DISTINCT key_column3) >=  COUNT(*)

Knowing this you can look for plausible combinations like so

SELECT COUNT(DISTINCT key_column1 || key_column2 || key_column3), COUNT(*)
FROM table

If the table has only a couple of thousand rows performance should not be an issue.

I gutes you can make this fully automatic by using the data dictionary but I have no solution at hand.

fhossfel
  • 2,041
  • 16
  • 24
  • Thanks for replying. To clarify: The data set doesn't grow and I'm sure there's a natural key but don't know which. I know the whole checking if the no. of distinct values in a set of columns equals the no. of distinct rows in the table and optimizing with the multiplication criterion but there's a growing number of combinations as you check for larger subsets. Also, though I didn't mention it, I wanna do it on multiple tables (about a 100), thus the need for an algorithm, that finds a key more efficiently than the "bruteforcish" way I described, without me having to check individuals tables. – Marco Jun 29 '17 at 23:52