AI Based Deduplication using Textual Similarity Measure in Python

Question

Given I have a dataframe that contains rows like this

ID	Title	Abstract	Keywords	Author	Year
5875	Textual Similarity: A Review	Textual Similarity has been used for measuring ...	X, Y, Z	James Thomas	2018
8596	Natural Language Processing: A Review	Natural Language Processing has been used for ...	NLP, AI, BERT	Rami John	2015
4586	Textual Similarity: Systematic Review	Text Similarity is being used for	Y, Z, AI	J Thomas	2018

I would like to make a function deduplicate which can ingest the dataframe and outputs a matrix that allows me to compare the records with each other.

def deduplicate(df):
    matrix = take in each row and compute a similarity matrix
    return matrix

Whereas matrix can be

ID	5875	8596	4586
5875	1	0.4	0.9
8596	0.4	1	0.5
4586	0.9	0.5	1

This will allow me to find which records are similar to each other by comparing how similar the records are. I think I need to use some NLP Models here, as the rows contain textual as well as numerical data.

Is there a way in Python to do this? Some people suggest using dedupe, but due to privacy laws at place in my organization, we can only have in-house capacity for the same. Any suggestions would be welcome.

score 1 · Answer 1 · answered Nov 08 '21 at 14:25

1

The easiest way to improve your comparison is Using TF-IDF (comprehensive explanation here)

One of the main weaknesses of fuzzy-wuzzy package is the ignorance of the importance of each string trail (subtoken, token, 2-gram, and ...). For example, two documents that contain the word Unicorn are most probably more similar to each other than two documents with the word USA (due to the overall scarcity of the word Unicorn). This is where a handy tool named TFIDF comes to play. TFIDF would consider the weight of each (n-gram, n-char) for measuring the similarity. Moreover, it's easy to use tanks to sklearn library.

#your corpus
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

There are plenty of more advanced methods you can exploit to improve the result.

answered Nov 08 '21 at 14:25

meti

1,921
1
8
15

This is helpful. In your case, `corpus` contains strings. In my case, do you recommend making a string taking all the columns of each row and then computing the same? – aayush_malik Nov 08 '21 at 14:56
1

I think It would be easier to do so, but calculating the similarity separately for each of the columns, then combining them using a weighted average is my choice! `final_similarity_score = w1*Title_similarity+w2*Abstact_similarity+...` The weighting parameters can be trainable or chosen based on heuristics. @AMal – meti Nov 08 '21 at 15:15
When you say "more advanced methods", you mean more advanced alternatives to cosine similarity metric? – aayush_malik Nov 09 '21 at 10:58
I meant advanced techniques for doc2vec (converting sentences into vectors) instead of simple (yet powerful) TF-IDF vectorization. For example, Embedding with a transformer-based model. @AMal – meti Nov 09 '21 at 11:46
Yes, I thought so. I was trying this out with using `sent2vec` and I have got the sentence embeddings. The next step I guess is to compute the distance between each pair of vector and make a matrix for that. Right? – aayush_malik Nov 09 '21 at 11:56
Did you try `doc2wec`? I think it can better accommodate your needs with smaller effort. – meti Nov 09 '21 at 13:18

AI Based Deduplication using Textual Similarity Measure in Python

1 Answers1