1

I have two dataframes extracted from two attached files. I want to compute JaroWinkler Similarity for tokens inside the files. I am using below code.

from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
df_gt['jarowinkler_sim'] = [jarowinkler.similarity(x.lower(), y.lower()) for x, y in zip(df_ex['abstract_ex'], df_gt['abstract_gt'])]

I am facing two problems:

1. Order of the tokens are not being handled. When position of the token 'can' and 'interesting' is changed similarity index is wrongly computed!!

    Unnamed: 0   abstract_gt  jarowinkler_sim
0            0     Bipartite         1.000000
1            1   fluctuations         0.914141
2            2           can         0.474747 <--|
3            3       provide         1.000000    |-- Position swapped in one file
4            4   interesting         0.474747 <--|
5            5   information         1.000000
6            6         about         1.000000
7            7  entanglement         1.000000
8            8    properties         1.000000
9            9           and         1.000000
10          10  correlations         1.000000

2. Size of the dataframe might not be always same. When one of the dataframe contains less elements my solution gives an error.

raise ValueError( ValueError: Length of values (10) does not match length of index (11)

How can I solve these two problems and compute the similarity accurately?

Thanks !!

TSV FILES

1. df_ex

    abstract_ex
0   Bipartite
1   fluctuations
2   interesting
3   provide
4   can
5   information
6   about
7   entanglement
8   properties
9   and
10  correlations

df_gt

    abstract_gt
0   Bipartite
1   fluctuations
2   interesting
3   provide
4   can
5   information
6   about
7   entanglement
8   properties
9   and
10  correlations
Pert8S
  • 582
  • 3
  • 6
  • 21

0 Answers0