0

I am currently having trouble with the following. I receive a job offer and I have to extract certain words from my CSV file. These words that I am trying to extract can be multiple tokens long (up to 4 tokens long) However, I have to keep in mind that there can be instances of misspellings and use of abbreviations. So a direct matching algorithm wouldn't give me a good result. What can I do to check whether the words in my CSV file are mentioned in the text? Keep in mind, I do not have a large dataset.

My original plan was to do a similarity match between the words in my CSV file and the whole text. To solve misspellings and abbreviations, I added a column with possible variations/abbreviations and also did a similarity match on those. If a similarity score would be above a certain threshold, and is the highest match, then it would be a 'match'. To do multiple-word matching, I added n-grams when doing the similarity match. However, I got a lot of false positives. Even setting a higher threshold did not solve my issue.

I also tried building a custom NER model. This worked decently. I even used my NER model to extract potentially relevant words and then did a similarity match to get good results. However, my solution needs to be easily expandable. Adding new words to the CSV file is easy, but retraining the NER model each time isn't ideal.

1 Answers1

0

What you could do is the following:

(0) Fix contractions in your text by using the python contractions library: https://pypi.org/project/contractions/. An example:

>>> import contractions
>>> text = "You're doing great, y'all!"
>>> contractions.fix(text)
'You are doing great, you all!'

(1) Word-tokenize your text (e.g. split it on spaces), and pass it through a spell checker (e.g. using NLTK: https://www.geeksforgeeks.org/correcting-words-using-nltk-in-python/; an older approach is to use Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance).

(2) Rely on a syntactic parser (e.g. spaCy) to solve the case/conjugation problem. Parse the text (use batching, through the pipe() function: How to speed up Spacy's nlp call?), after correction by means of a spell checker. Apply an ngram extractor (e.g. https://www.askpython.com/python/examples/n-grams-python-nltk) to your text, and check, for each sequence of words in your CSV file, if the tokenised representation of that sequence occurs in the ngram sequence. E.g. word sequence in csv file = "green tree", parsed: "green tree" - text: "there are many green trees in the street", parse: "there is many green tree in the street". The ngrams obtained for this text will contain "green tree".

(3) The abbreviations problem is tricky. You could use regexes to identify abbreviations: Detect abbreviations in the text in python. To expand them you could use a probabilistic approach as suggested here: https://loeb.nyc/blog/data-science-word-expander.

mr_faulty
  • 103
  • 7