I have two dataframes extracted from two attached files. I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code.
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
df_gt['jarowinkler_sim'] = [jarowinkler.similarity(x.lower(), y.lower()) for x, y in zip(df_ex['abstract_ex'], df_gt['abstract_gt'])]
I am facing two problems:
1. Order of the tokens are not being handled. When position of the token 'can' and 'interesting' is changed similarity index is wrongly computed!!
Unnamed: 0 abstract_gt jarowinkler_sim
0 0 Bipartite 1.000000
1 1 fluctuations 0.914141
2 2 can 0.474747 <--|
3 3 provide 1.000000 |-- Position swapped in one file
4 4 interesting 0.474747 <--|
5 5 information 1.000000
6 6 about 1.000000
7 7 entanglement 1.000000
8 8 properties 1.000000
9 9 and 1.000000
10 10 correlations 1.000000
2. Size of the dataframe might not be always same. When one of the dataframe contains less elements my solution gives an error.
raise ValueError( ValueError: Length of values (10) does not match length of index (11)
How can I solve these two problems and compute the similarity accurately?
Thanks !!
TSV FILES
1. df_ex
abstract_ex
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations
df_gt
abstract_gt
0 Bipartite
1 fluctuations
2 interesting
3 provide
4 can
5 information
6 about
7 entanglement
8 properties
9 and
10 correlations