0

I have created an index with fields (id, title, url, content) for storing web page information by crawling. Now i want to search that index with multiple word queries( also Boolean queries), suggest good n efficient searching algorithms(some examples) and efficient parsing. Please help

WXy
  • 1
  • 3

1 Answers1

0

Do you want to search only the title, or the content too? Assuming you want to allow partial searches on the title, which return an URL and/or the content, the schema would be:

 schema = Schema(id=ID(stored=True), title=NGRAM(minsize=2, maxsize=20,stored=True, sortable=ranking_col), url=STORED(), content=STORED())

This works fine with standard Whoosh searcher up to ~1000000 titles. For more entries, the ngram index will became very big and slow.

Also, use stopwords to reduce the index size:

stopwords = set(['of', 'by', 'the','in','for','a']) #words to be excluded from the index    
def create_whoosh(self):
    writer = ix.writer()
    for t in documents:
        words = [t for t in t.title.split(" ") if t not in stopwords]  #remove stopwords
        writer.add_document(title=" ".join(words), url=t.url, content=t.content)
    writer.commit()    

The searcher :

def lookup(self, terms):
 with ix.searcher() as src:
        query = QueryParser("term", ix.schema).parse(terms)
        results = src.search(query, limit=30)
        return  [[r['url'],r['content']] for r in results]

If searching for full words in title and content is intended, you can do:

 schema = Schema(id=ID(stored=True), title=TEXT(stored=True), url=STORED(), content=TEXT(stored=True))

This will not work for substring search, but can cope well with some milions of documents (depending on the size of content)

For indexing ~10 milions of documents, you'll need to separately store the content in some sort of DB and lookup only the ID with whoosh.

Nutiu Lucian
  • 169
  • 2
  • 5