Python/Pandas - String Comparisons

Question

I have a list of strings/narratives which I need to compare and get a distance measure between each string. The current code I have written works but for larger lists it takes along time since I use 2 for loops. I have used the levenshtien distance to measure the distance between strings.

The list of strings/narratives is stored in a dataframe.

def edit_distance(s1, s2):
   m=len(s1)+1
   n=len(s2)+1

   tbl = {}
   for i in range(m): tbl[i,0]=i
   for j in range(n): tbl[0,j]=j
   for i in range(1, m):
       for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
   return tbl[i,j]

def narrative_feature_extraction(df):
    startTime = time.time()
    leven_matrix = np.zeros((len(df['Narrative']),len(df['Narrative'])))  
    for i in range(len(df['Narrative'])):
        for j in range(len(df['Narrative'])):
            leven_matrix[i][j] = edit_distance(df['Narrative'].iloc[i],df['Narrative'].iloc[j])
    endTime = time.time()
    total = (endTime - startTime)
    print "Feature Extraction (Leven) Runtime:" + str(total)
    return leven_matrix 


X = narrative_feature_extraction(df)

If the list has n narratives, the resulting X is a nxn matrix, where the rows are the narratives and the columns is what that narrative is compared to. For example, for the distance (i,j) it is the levenshtien distance between narrative i and j.

Is there a way to optimize this code so that there isn't a need to have so many for loops? Or is there a pythonic way of calculating this?

codereview might be a better place for this – depperm Jun 15 '17 at 13:01 — depperm, Jun 15 '17 at 13:01

Vlox · Answer 1 · 2017-06-15T14:19:08.190

hard to give exact code without data/examples, but a few suggestions:

Use list comprehension, much faster than for ... in range ...
Depending on your version of pandas, "df[i][j]" indexing can be veeeery slow, instead use .iloc or .loc (if you want to mix and match use .iloc[df.index.get_loc("itemname"),df.columns.get_loc("itemname")] to convert loc to iloc properly if you have this issue. (I think it is only slow if you are getting warning flags for writing to a dataframe slice and depends a lot on what version of python/pandas you have, but have not tested extensively)
Better yet, run all calcs and then throw into dataframe in one go depending on your use case
If you like the pythonic reading of for loops, try to avoid using "in range" at least and instead use "for j in X[:,0]" for example. I find this to be faster in most cases, and you can use with enumerate to keep index values (example below)

Examples/timings:

def test1(): #list comprehension
    X=np.random.normal(size=(100,2))
    results=[[x*y for x in X[:,0]] for y in X[:,1]]
    df=pd.DataFrame(data=np.array(results))

if __name__ == '__main__':
    import timeit
    print("test1: "+str(timeit.timeit("test1()", setup="from __main__ import test1",number=10)))

def test2(): #enumerate, df at end
    X=np.random.normal(size=(100,2))
    results=np.zeros((100,100))
    for ind,i in enumerate(X[:,0]):
        for col,j in enumerate(X[:,1]):
            results[ind,col]=i*j
    df=pd.DataFrame(data=results)

if __name__ == '__main__':
    import timeit
    print("test2: "+str(timeit.timeit("test2()", setup="from __main__ import test2",number=10)))

def test3(): #in range, but df at end
    X=np.random.normal(size=(100,2))
    results=np.zeros((100,100))
    for i in range(len(X)):
        for j in range(len(X)):
            results[i,j]=X[i,0]*X[j,1]
    df=pd.DataFrame(data=results)

if __name__ == '__main__':
    import timeit
    print("test3: "+str(timeit.timeit("test3()", setup="from __main__ import test3",number=10)))

def test4(): #current method
    X=np.random.normal(size=(100,2))
    df=pd.DataFrame(data=np.zeros((100,100)))
    for i in range(len(X)):
        for j in range(len(X)):
            df[i][j]=(X[i,0]*X[j,1])

if __name__ == '__main__':
    import timeit
    print("test4: "+str(timeit.timeit("test4()", setup="from __main__ import test4",number=10)))

output:

test1: 0.0492231889643
test2: 0.0587620022106
test3: 0.123777403419
test4: 12.6396287782

so list comprehension is ~250 times faster, and enumerate is twice as fast as "for x in range". Although the real slowdown is individual indexing of your dataframe (even if using .loc or .iloc this will still be your bottleneck so I suggest working with arrays outside of the df if possible)

Hope this helps and you are able to apply to your case. I'd recommend reading up on map, filter, reduce, (maybe enumerate) functions as well as they are quite quick and might help you: http://book.pythontips.com/en/latest/map_filter.html

Unfortunately I am not really familiar with your use case though, but I don't see a reason why it wouldn't be applicable or compatible with this type of code tuning.

Python/Pandas - String Comparisons

1 Answers1