2

How to rank the data frame based on the row value. i.e I have a row that contains text data want to provide the rank based on the similarity?

Input

Expected output

expected output

i have tried with the levistian distance but not sure how can i do for the whole table

def bow(x=None):
    x = x.lower()
    words = x.split(' ')
    words.sort()
    x = ' '.join(words)
    
    exclude = set('{}{}'.format(string.punctuation, string.digits))
    x = ''.join(ch for ch in x if ch not in exclude)
    x = '{} '.format(x.strip())
    return x

#intents = load_intents(export=True)
df['bow'] = df['name'].apply(lambda x: bow(x))

df.sort_values(by='bow',ascending=True,inplace=True)

last_bow = ''
recs = []
for idx,row in df.iterrows():
    
    record = { 
        'name': row['name'],
        'bow': row['bow'],
        'lev_distance': ed.eval(last_bow,row['bow'])
    }
    recs.append(record)
    last_bow = row['bow']

intents = pd.DataFrame(recs,columns=['name','bow','lev_distance'])

l = intents[intents['lev_distance'] <= lev_distance_range]

r = []
for x in l.index.values:
    r.append(x - 1)
    r.append(x)

r = list(set(r))
    
l = intents.iloc[r,:]
Kum_R
  • 368
  • 2
  • 19
  • Is your problem about how to calculate the distance? Is about how to sort and rank your dataframe? It seems like you're calculating the distance between consecutive rows, are you sure that this is what you want? Maybe you need the distance between all pairs of words? – aaossa Feb 25 '22 at 14:32
  • @aaossa yes i need to calculate the distance between all the row values and arrange them in the order by score – Kum_R Feb 26 '22 at 15:09

1 Answers1

1

Using textdistance, you could try this:

import pandas as pd
import textdistance

df = pd.DataFrame(
    {
        "text": [
            "Rahul dsa",
            "Rasul dsad",
            "Raul ascs",
            "shrez",
            "Indya",
            "Indi",
            "shez",
            "india",
            "kloa",
            "klsnsd",
        ],
    }
)

df = (
    df
    .assign(
        match=df["text"].map(
            lambda x: [
                i
                for i, text in enumerate(df["text"])
                if textdistance.jaro_winkler(x, text) >= 0.9
            ]
        )
    )
    .sort_values(by="match")
    .drop(columns="match")
)


print(df)
# Output
         text
0   Rahul dsa
1  Rasul dsad
2   Raul ascs
3       shrez
6        shez
4       Indya
5        Indi
7       india
8        kloa
9      klsnsd
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • @Lauret thank you. but I am facing issues when the token length is more than 20 character. The method is not able to calculate the distance. – Kum_R Feb 28 '22 at 10:16
  • Thank you and accepted the answer. Could you please help here creating new question will be duplicate. Kindly help – Kum_R Mar 03 '22 at 10:47
  • I kindly suggest that you post a new question with new data where tokens have a length superior to 20 characters, which is not the case in your original question (which is solved, as it is). Cheers. – Laurent Mar 06 '22 at 09:06
  • pls find the question https://stackoverflow.com/questions/71379158/rank-the-row-based-on-the-similar-sentences-using-python-or-sql – Kum_R Mar 07 '22 at 09:50