1

Assume I have a very long text and I'd like to extract a certain length of context around a specific word. For example in the following text I'd like to extract 8 words around the word warrior.

........

........

... died. He was a very brave warrior, fighting for freedom against the odds ...

........

........

In this case the result would be

He was a very brave warrior, fighting for freedom

Notice how I dropped the word died as I'd prefer starting from the beginning of a full sentence and how I extracted more than just 8 words because fight for freedom is much more meaningful than just fighting for.

Are there any algorithms, or research conducted in this field that I could follow? How should I go about approaching this problem.

vondip
  • 13,809
  • 27
  • 100
  • 156

2 Answers2

0
  1. You can use RegEx to get whole sentence that contains word you are looking for.
  2. Then use Information Extraction algorithm to find more convenient 8 words.

I found some Python realisation of both

For regexp look here

And for Extracting algorithm look here

Hope this will help you

Community
  • 1
  • 1
Gor
  • 2,808
  • 6
  • 25
  • 46
  • note that for the kind of things shown in the link, parsey mcparseface tends to do a little better than nltk – thang Jun 11 '17 at 22:49
0

Let's divide your problem into parts and keep it independent of any programming language:

  1. If you want the word fight instead of fighting, you should preprocess your data. Please take a look at lemmatization and stemming techniques which will give you the root words.

  2. Also, another text preprocessing step would be to eliminate the stop words from your text. Words such as the, will, if, but etc will be removed.

  3. Now to extract n-words, you can define a window size that will extract n number of words from your sentence text. So all you have to do is, write a function that will take the target text and word around which you want to extract the words. Iterate this loop over your entire text.

Hope this helps.

Saurabh Jain
  • 1,600
  • 1
  • 20
  • 30