I have an indexed Pandas Series with 20k entries. Each entry is an array of strings.
id | value
0 | ['abc', 'abc', 'def']
1 | ['bac', 'c', 'def', 'a']
2 | ...
...|
20k| ['aaa', 'rzt']
I want to compare each entry (lists of strings) with every other entry of the series. I have a complex comparison function which takes two lists of strings and return a float.
The result should be a matrix.
id | 0 | 1 | 2 | ... | 20k
0 | 1 0.5 0.4
1 | 0.5 1 0.2
2 | 0.4 0.2 1
...|
20k|
A double loop computing the result of every matrix element takes my computer more than 3 hours. How can I effectively apply/parallelise my comparison function? I tried broadcasting using numpy arrays without success (no speedup).
values = df['value'].values
broadcasted = np.broadcast(values, values[:,None])
result = np.empty(broadcasted.shape)
result.flat = [compare_function(u,v) for (u,v) in broadcasted]