Given I have a dataframe that contains rows like this
ID | Title | Abstract | Keywords | Author | Year |
---|---|---|---|---|---|
5875 | Textual Similarity: A Review | Textual Similarity has been used for measuring ... | X, Y, Z | James Thomas | 2018 |
8596 | Natural Language Processing: A Review | Natural Language Processing has been used for ... | NLP, AI, BERT | Rami John | 2015 |
4586 | Textual Similarity: Systematic Review | Text Similarity is being used for | Y, Z, AI | J Thomas | 2018 |
I would like to make a function deduplicate
which can ingest the dataframe and outputs a matrix that allows me to compare the records with each other.
def deduplicate(df):
matrix = take in each row and compute a similarity matrix
return matrix
Whereas matrix can be
ID | 5875 | 8596 | 4586 |
---|---|---|---|
5875 | 1 | 0.4 | 0.9 |
8596 | 0.4 | 1 | 0.5 |
4586 | 0.9 | 0.5 | 1 |
This will allow me to find which records are similar to each other by comparing how similar the records are. I think I need to use some NLP Models here, as the rows contain textual as well as numerical data.
Is there a way in Python to do this? Some people suggest using dedupe, but due to privacy laws at place in my organization, we can only have in-house capacity for the same. Any suggestions would be welcome.