I have a list of strings/narratives which I need to compare and get a distance measure between each string. The current code I have written works but for larger lists it takes along time since I use 2 for loops. I have used the levenshtien distance to measure the distance between strings.
The list of strings/narratives is stored in a dataframe.
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
def narrative_feature_extraction(df):
startTime = time.time()
leven_matrix = np.zeros((len(df['Narrative']),len(df['Narrative'])))
for i in range(len(df['Narrative'])):
for j in range(len(df['Narrative'])):
leven_matrix[i][j] = edit_distance(df['Narrative'].iloc[i],df['Narrative'].iloc[j])
endTime = time.time()
total = (endTime - startTime)
print "Feature Extraction (Leven) Runtime:" + str(total)
return leven_matrix
X = narrative_feature_extraction(df)
If the list has n narratives, the resulting X is a nxn matrix, where the rows are the narratives and the columns is what that narrative is compared to. For example, for the distance (i,j) it is the levenshtien distance between narrative i and j.
Is there a way to optimize this code so that there isn't a need to have so many for loops? Or is there a pythonic way of calculating this?