Fun question :)
So I have an idea and I think it is O(n*log(n) + n)
where +n
asymptotically irrelevant.
So I would propose something as follows:
tuple_len = 10
min_value = 1
max_value = 10
number_of_entries = 100
l = [[j] + [randint(min_value,max_value) for i in range(tuple_len)] for j in range(number_of_entries)]
Base set:
[[0, 9, 10, 3, 6, 3, 10, 9, 7, 8, 4],
[1, 2, 3, 6, 7, 9, 2, 5, 10, 6, 10],
[2, 5, 4, 10, 8, 5, 9, 2, 7, 4, 3],
[3, 5, 9, 4, 5, 5, 3, 10, 1, 4, 4],
[4, 9, 10, 9, 10, 9, 10, 6, 1, 6, 2],
[5, 5, 6, 3, 6, 9, 5, 8, 3, 1, 1],
[6, 9, 7, 5, 5, 5, 2, 1, 2, 3, 6],
[7, 2, 6, 9, 10, 5, 6, 7, 3, 7, 5],
[8, 6, 8, 9, 3, 7, 1, 2, 9, 8, 10],
[9, 7, 5, 7, 2, 1, 3, 7, 1, 2, 9],
[10, 1, 4, 4, 3, 6, 9, 6, 3, 3, 8],
[11, 8, 3, 10, 10, 5, 9, 7, 3, 4, 5],
...]
So I just used numbers for convenience and added the position in list as first value.
I propose to sort the dataset for each column of the data in turn, where sorting is O(n*log(n))
, and then add the positional value of all entries with the same value to a set, which is O(n)
. The result looks something like:
[{6, 18, 24, 26},
{22, 34},
{1, 6, 19, 31, 57, 58},
{1, 9, 18},
...}
This can be interpreted as Entry 6, 18, 24 and 26 have the same value in position 1.
Checking if two entries correspond is Ò(1)
with
true if (a in match_set) and (b in match_set) else false
Code example below:
match_sets = [set() for i in range(tuple_len)]
for position in range(tuple_len):
l = sorted(l, key= lambda x: x[position+1])
last_value = l[0][position+1]
for entry in range(number_of_entries):
if l[entry][position + 1] == last_value:
match_sets[position].add(l[entry][0])
last_value = l[entry][position + 1]