0

I have a list of let's say "forbidden sentences" (1000 of them, each with around 40 words). I want to create a tool that will find and mark them in a given document.

The problem is that in such document this forbidden sentence can be expressed differently than it is on this list keeping the same meaning but changed by using synonyms, a few words more or less, different word order, punctuation, grammar etc. The fact that this is all in Polish is not making things easier with each noun, pronoun, and adjective having 14 cases in total plus modifiers and gender that changes the words further. I was also thinking about making it so that the found sentences are ranked by the probability of them being forbidden with some displaying less resemblance.

I studied IT for two years but I don't have much knowledge in NLP. Do you think this is possible to be done by an amateur? Could you give me some advice on where to start, what tools to use best to put it all together? No need to be fancy, just practical. I was hoping to find some ready to use code cause i imagine this is sth that was made before. Any ideas where to find such resources or what keywords to use while searching? I'd really appreciate some help cause I'm very new to this and need to start with the basics.

Thanks in advance,

Kamila

1 Answers1

0

Probably the easiest first try will be to use polish SpaCy, which is an extension of popular production-ready NLP library to support polish language.

http://spacypl.sigmoidal.io/#home

You can try to do it like this:

  • Split document into sentences.
  • Clean these sentences with spacy (deleting stopwords, punctuation, doing lemmatization - it will help you with many differnet versions of the same word)
  • Clean "forbidden sentences" as well
  • Prepare vector representation of each sentence - you can use spaCy methods
  • Calculate similarity between sentences - cosine similarity
  • You can set threshold, from which if sentences of document is similar to any of "forbidden sentences" it will be treated as forbidden

If anything is not clear let me know.

Good luck!

kalbarena
  • 76
  • 4