1

I have been going through many Libraries like whoosh/nltk and concepts like word net.

However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.

Question: My scenario is that I have to search for key words. Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.

The catch is: Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?

The code for the POS Tags is as follows. If no library is available I will have to proceed with this.

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet

def tag(x):
    return pos_tag(word_tokenize(x))



synonyms = []
antonyms = []

for syn in wordnet.synsets("Sales document"):
    #print("Down2")
    print (syn)
    #print("Down")
    for l in syn.lemmas():
        print(" \n")
        print(l)
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

for i in synonyms:
    print(tag(i))

Update: We went ahead and made a python program - Feel free to fork it. (Pun intended) Further the Git Dhund is very untidy right now will clean it once completed. Currently it is still in a development phase.

The is the link.

Innat
  • 16,113
  • 6
  • 53
  • 101
  • I'm going to vote this as off-topic since you haven't shown an attempt. I'm happy to retract the close vote if you post something that shows you've tried. – erip May 29 '18 at 19:49
  • @erip Sometimes you just don't know what you don't know! I like SO's ability to give a direction, even if there isn't a clear answer. – PANDA Stack May 29 '18 at 21:01
  • I just want to know if there is a readily available library or do I proceed with building my own repository for this requirement using the POS Tags. – Shivam Kashyap May 30 '18 at 06:25
  • This kind of indexing is built into SOLR. You can search by token proximity. – duffymo Jun 18 '18 at 13:36

1 Answers1

2

To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.

You can also define slop in Query like this: "Sales should be documented"~5


To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.

Assem
  • 11,574
  • 5
  • 59
  • 97