0

Simple set-up: I have a list (roughly 40,000 entries) containing lists of strings (each with 2-15 elements). I want to compare all of the sublists to check if they have a common element (they share at most one). At the end, I want to create a dictionary (graph if you wish) where the index of each sublist is used as a key, and its values are the indices of the other sublists with which it shares common elements.

For example

lst = [['dam', 'aam','adm', 'ada', 'adam'], ['va','ea','ev','eva'], ['va','aa','av','ava']]

should give the following:

dic = {0: [], 1: [2], 2: [1]}

My problem is that I found a solution, but it's very computationally expensive. First, I wrote a function to compute the intersection of two lists:

def intersection(lst1, lst2): 

    temp = set(lst2) 
    lst3 = [value for value in lst1 if value in temp] 
    return lst3 

Then I would loop over all the lists to check for intersections:

dic = {}
iter_range = range(len(lst))

#loop over all lists where k != i
for i in iter_range:

    #create range that doesn't contain i
    new_range = list(iter_range)
    new_range.remove(i)

    lst = []

    for k in new_range:
        #check if the lists at position i and k intersect
        if len(intersection(mod_names[i], mod_names[k])) > 0:
            lst.append(k)

    # fill dictionary 
    dic[i] = lst

I know that for loops are slow, and that I'm looping over the list unnecessarily often (in the above example, I compare 1 with 2, then 2 with 1), but I don't know how to change it to make the program run faster.

  • Can you explain how the output is supposed to be generated? What are you computing the intersections of? I only see substrings. – cs95 Jan 06 '19 at 17:49
  • The output should be in the 'dic'. From the example: I have three lists, and I do pairwise comparison to see if they contain the same string. if that's the case, 'intersection' spits out a non-zero list, and I append its index k to my list 'lst'. Does this make it clearer? – archipelagic Jan 06 '19 at 18:04
  • No, because I see in the first list, "dam" and "ada" have common elements with "adam", so why are there no entries for `dic` there? – cs95 Jan 06 '19 at 18:05
  • What I compare is if the exact same string appears in another list, e.g. 'va' appears in both list 2 and 3 (or 1 and 2, if you use Python's indexing). And note that I don't compare a list with itself. – archipelagic Jan 07 '19 at 09:23

1 Answers1

1

You can create a dict word_occurs_in which will store data which word occurs in which lists, for your sample that would be:

{'dam': [0], 'aam': [0], 'adm': [0], 'ada': [0], 'adam': [0], 'va': [1, 2], 'ea': [1], 'ev': [1], 'eva': [1], 'aa': [2], 'av': [2], 'ava': [2]}

Then you can create a new dict, let's call it result, in which you should store the final result, e.g. {0: [], 1: [2], 2: [1]} in your case.

Now, to get result from word_occurs_in, you should traverse the values of word_occurs_in and see if the list has more then one element. If it does, then you just need add all other values except the value of the currently observed key in result. For instance, when checking the value [1, 2] (for key 'va'), you' will add 1 to the value corresponding to 2 in the result dict and will add 2 to the value corresponding to key 1. I hope this helps.

In my understanding, the biggest complexity to your code comes from iterating the list of 40K entries twice, so this approach iterates the list only once, but uses a bit more space.

Maybe I didn't explain myself sufficiently, so here is the code:

from collections import defaultdict

lst = [['dam', 'aam', 'adm', 'ada', 'adam'], ['va', 'ea', 'ev', 'eva'], ['va', 'aa', 'av', 'ava']]

word_occurs_in = defaultdict(list)

for idx, l in enumerate(lst):
    for i in l:
        word_occurs_in[i].append(idx)

print(word_occurs_in)

result = defaultdict(list)
for v in word_occurs_in.values():
    if len(v) > 1:
        for j in v:
            result[j].extend([k for k in v if k != j])

print(result)
giliev
  • 2,938
  • 4
  • 27
  • 47