-2

I'm trying to implement a similarity function using

  • N-Grams
  • TF-IDF
  • Cosine Similaity

Example enter image description here

Concept:

words = [...]
word = '...'
similarity = predict(words,word)

def predict(words,word):
     words_ngrams = create_ngrams(words,range=(2,4))  
     word_ngrams =  create_ngrams(word,range=(2,4))

     words_tokenizer = tfidf_tokenizer(words_ngrams)
     word_vec = words_tokenizer.transform(word)

     return cosine_similarity(word_ved,words_tokenizer)

I searched the web for a simple and safe implementation but I couldn't find one that was using known python packages as sklearn, nltk, scipy etc.
most of them using "self made" calculations.

I'm trying to avoid coding every step by hand, and I'm guessing there is an easy fix for all of 'that pipeline'.

any help(and code) would be appreciated. tnx :)

Sahar Millis
  • 801
  • 2
  • 13
  • 21

1 Answers1

0

Eventualy I figured it out...

For who ever will find the need of a solution for this Q, here's a function I wrote that takes care of it...

'''
### N-Gram & TD-IDF & Cosine Similarity
Using n-gram on 'from column' with TF-IDF to predict the 'to column'.
Adding to the df a 'cosine_similarity' feature with the numeric result.
'''
def add_prediction_by_ngram_tfidf_cosine( from_column_name,ngram_range=(2,4) ):
    global df
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    vectorizer = TfidfVectorizer( analyzer='char',ngram_range=ngram_range )
    vectorizer.fit(df.FromColumn)

    w = from_column_name
    vec_word = vectorizer.transform([w])

    df['vec'] = df.FromColumn.apply(lambda x : vectorizer.transform([x]))
    df['cosine_similarity'] = df.vec.apply(lambda x : cosine_similarity(x,vec_word)[0][0])

    df = df.drop(['vec'],axis=1)

Note: it's not production ready

Sahar Millis
  • 801
  • 2
  • 13
  • 21