2

I have around 500,000 arrays of 10 words i.e. 500,000 word 10-grams. For every 10-gram, I need to know in which positions, if any, the remaining 499,999 10-grams have identical elements:

a = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

b = ['A', 'M', 'C', 'M', 'E', 'M', 'G', 'M', 'I', 'M']

...

z = ['R', 'R', 'R', 'R', 'R', 'F', 'G', 'H', 'I', 'J']

If we use a 1 for positions where the two arrays contain the same word and a 0 for positions where they contain different words, the intersection of a with b would be represented as [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]; the intersection of a with z would be represented as [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], etc.

Can we do better than a naive O(n**2) algorithm, i.e. one for loop within another for loop?

melanuria
  • 21
  • 3
  • 1
    It seems that you want to fill a n x n matrix n x n, I do not see how this could be done in less than O(n**2). May be you should switch to an algorithm that does not require pairwise comparison. – hpchavaz Mar 03 '22 at 18:59
  • 1
    I tried several (much more efficient) algorithms but none of them came close to the wonderful results I obtain from this pairwise-comparison algorithm. – melanuria Mar 03 '22 at 20:10

1 Answers1

0

Fun question :)

So I have an idea and I think it is O(n*log(n) + n) where +n asymptotically irrelevant.

So I would propose something as follows:

tuple_len = 10
min_value = 1
max_value = 10
number_of_entries = 100
l = [[j] + [randint(min_value,max_value) for i in range(tuple_len)] for j in range(number_of_entries)]

Base set:

[[0, 9, 10, 3, 6, 3, 10, 9, 7, 8, 4],
 [1, 2, 3, 6, 7, 9, 2, 5, 10, 6, 10],
 [2, 5, 4, 10, 8, 5, 9, 2, 7, 4, 3],
 [3, 5, 9, 4, 5, 5, 3, 10, 1, 4, 4],
 [4, 9, 10, 9, 10, 9, 10, 6, 1, 6, 2],
 [5, 5, 6, 3, 6, 9, 5, 8, 3, 1, 1],
 [6, 9, 7, 5, 5, 5, 2, 1, 2, 3, 6],
 [7, 2, 6, 9, 10, 5, 6, 7, 3, 7, 5],
 [8, 6, 8, 9, 3, 7, 1, 2, 9, 8, 10],
 [9, 7, 5, 7, 2, 1, 3, 7, 1, 2, 9],
 [10, 1, 4, 4, 3, 6, 9, 6, 3, 3, 8],
 [11, 8, 3, 10, 10, 5, 9, 7, 3, 4, 5],
...]

So I just used numbers for convenience and added the position in list as first value.

I propose to sort the dataset for each column of the data in turn, where sorting is O(n*log(n)), and then add the positional value of all entries with the same value to a set, which is O(n). The result looks something like:

[{6, 18, 24, 26},
 {22, 34},
 {1, 6, 19, 31, 57, 58},
 {1, 9, 18},
...}

This can be interpreted as Entry 6, 18, 24 and 26 have the same value in position 1. Checking if two entries correspond is Ò(1) with

true if (a in match_set) and (b in match_set) else false

Code example below:

match_sets = [set() for i in range(tuple_len)]


for position in range(tuple_len):
    l = sorted(l, key= lambda x: x[position+1])
    last_value = l[0][position+1]
    for entry in range(number_of_entries):
        if l[entry][position + 1] == last_value:
            match_sets[position].add(l[entry][0])
            last_value = l[entry][position + 1]
        
FloLie
  • 1,820
  • 1
  • 7
  • 19
  • I may be wrong but this does not seems to answer the question. In what time the n x n matrix can be filled from `match_sets `? BTW, see my comment under the question. – hpchavaz Mar 03 '22 at 19:04
  • 1
    You are right, in that it does not fill the matrix in that time, but filling a nxn matrix can never be faster than ```O(n²)```. However, the list of sets does contain the same information, with similar access complexity, therefore it may still be of interest – FloLie Mar 03 '22 at 19:29