0

I have a program working for calculating the distance and then apply the k-means algorithm. I tested on a small list and it's working fine and fast, however, my original list is very big (>5000), so it's taking forever and I ended it up terminating the running. Can I use outer() or any other parallel function and apply it to the distance function to make this faster?? On the small set that I have:

strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']

And its distance 3D array returns like this:

[[[ 0.          0.25        0.47826087  1.          1.          0.89473684]
  [ 0.25        0.          0.36842105  1.          1.          0.86666667]
  [ 0.47826087  0.36842105  0.          1.          1.          0.90909091]
  [ 1.          1.          1.          0.          0.5         1.        ]
  [ 1.          1.          1.          0.5         0.          1.        ]
  [ 0.89473684  0.86666667  0.90909091  1.          1.          0.        ]]]

Each line of the array above represents the distance for one item in the strings list. My way of doing it using the for loops is:

strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']


data1 = []


for j in range(len(np.array(list(strings)))):

     for i in range(len(strings)):
       data1.append(1-Levenshtein.ratio(np.array(list(strings))[j], np.array(list(strings))[i]))

#n =(map(Levenshtein.ratio, strings))
#n =(reduce(Levenshtein.ratio, strings))
#print(n)



k=len(strings)
data2=np.asarray(data1)
arr_3d = data2.reshape((1,k,k))
print(arr_3d)

Where arr_3d is the array above. How can I use any of outer() or map() to replace the for loops above, because when the list strings is big, it's taking hours and never got the results even. I appreciate the help. Levenshtein.ratio is a built in funciton in python.

Lelo
  • 347
  • 3
  • 16
  • 1
    `reduce` and `map` won't make this any faster. Why are you doing `np.array(list(strings))[j]` instead of just `strings[j]`? – user2357112 May 26 '16 at 20:04
  • Also, `Levenshtein.ratio` is not a thing that comes with Python. Where is this function coming from? – user2357112 May 26 '16 at 20:05
  • It's an older trial to make my last error works, that is not necessary, it can be strings[j].. but what would make it faster then?? – Lelo May 26 '16 at 20:06
  • it comes from the package called "Levenshtein", so I should have import Levenshtein at the very beginning – Lelo May 26 '16 at 20:07
  • 2
    using `map` does not mean the loop disappears. it just means it is not in your code. There is no magic trick here. – njzk2 May 26 '16 at 20:09
  • what about reduce()??? I used this function in R and it makes things faster instead of for loops, but I don't know how to use it with python, Any ideas? – Lelo May 26 '16 at 20:10
  • `reduce` won't help you either. This isn't even a reduction operation. The best you can do without switching technologies is to take out those unnecessary, hideously expensive `np.array(list(strings))`. You might be able to do somewhat better with Cython or C. – user2357112 May 26 '16 at 20:14
  • Note sure if that's the issue, I 'm facing slowness even before putting those sentences. I'm restricted on using python – Lelo May 26 '16 at 20:18
  • Can you show me how to do better with Cython? – Lelo May 26 '16 at 20:19
  • And when I wait, I'm getting MEMORYERROR – Lelo May 26 '16 at 21:10

1 Answers1

0
import numpy as np 

strings = ['cosine cos', 'cosine', 'cosine???????', 'l1', 'l2', 'manhattan']

k=len(strings)

data = np.zeros((k,k))

for i,string1 in enumerate(strings):
    for j,string2 in enumerate(strings):
        data[i][j] = 1-Levenshtein.ratio(string1, string2)

print data

No gains to be had with map or reduce here, the loops need to be run as @user2357112 mentions, however, this is cleaner and should run faster since it avoids the np.array(list(strings)) you were using throughout.

control_fd
  • 348
  • 2
  • 8
  • Thanks. That made it faster. still slow though when the clustering operation comes. – Lelo May 27 '16 at 19:02