3

I have an indexed Pandas Series with 20k entries. Each entry is an array of strings.

id | value

0  | ['abc', 'abc', 'def']
1  | ['bac', 'c', 'def', 'a']
2  | ...
...|
20k| ['aaa', 'rzt']

I want to compare each entry (lists of strings) with every other entry of the series. I have a complex comparison function which takes two lists of strings and return a float.

The result should be a matrix.

id | 0  |  1  |  2  | ... | 20k

0  | 1    0.5   0.4
1  | 0.5   1    0.2
2  | 0.4  0.2    1
...|
20k|

A double loop computing the result of every matrix element takes my computer more than 3 hours. How can I effectively apply/parallelise my comparison function? I tried broadcasting using numpy arrays without success (no speedup).

values = df['value'].values
broadcasted = np.broadcast(values, values[:,None])
result = np.empty(broadcasted.shape)
result.flat = [compare_function(u,v) for (u,v) in broadcasted]
Kerjo
  • 31
  • 2
  • 3
    Depends on your function. I'm skeptical that a complex function that takes two lists of strings and returns a float can be "vectorized" in the same way that x+4 is vectorized. If it's some sort of fuzzy match you have 400 M calculations to run, so it wont be fast. – ALollz Mar 11 '19 at 17:04
  • Well in this case 200M - 20,000, since it seems to be symmetric – ALollz Mar 11 '19 at 17:14
  • The chance of speedup depends mostly on the form of the `compare_function` you want to apply. You can explicitly parallelize across CPU cores if the calculations are independent. Also, if the `compare_function` is symmetric then you can calculate only diagonal matrix and copy to the other diagonal. To apply both you can use the splitting suggested in [this question](https://stackoverflow.com/questions/46237201/how-to-split-diagonal-matrix-into-equal-number-of-items-each-along-one-of-axis). All that is not necessarily what `pandas` supports out-of-the-box. – sophros Mar 11 '19 at 17:58
  • Ty. The calculations are independent and the function is indeed symmetric. It computes a specific distance between the two lists. I'll try explicit parallelism. I was hoping numpy/pandas had a way to apply an independent function using multiple cores. – Kerjo Mar 11 '19 at 21:52

0 Answers0