I am currently having trouble with the following. I receive a job offer and I have to extract certain words from my CSV file. These words that I am trying to extract can be multiple tokens long (up to 4 tokens long) However, I have to keep in mind that there can be instances of misspellings and use of abbreviations. So a direct matching algorithm wouldn't give me a good result. What can I do to check whether the words in my CSV file are mentioned in the text? Keep in mind, I do not have a large dataset.
My original plan was to do a similarity match between the words in my CSV file and the whole text. To solve misspellings and abbreviations, I added a column with possible variations/abbreviations and also did a similarity match on those. If a similarity score would be above a certain threshold, and is the highest match, then it would be a 'match'. To do multiple-word matching, I added n-grams when doing the similarity match. However, I got a lot of false positives. Even setting a higher threshold did not solve my issue.
I also tried building a custom NER model. This worked decently. I even used my NER model to extract potentially relevant words and then did a similarity match to get good results. However, my solution needs to be easily expandable. Adding new words to the CSV file is easy, but retraining the NER model each time isn't ideal.