I have created an index with fields (id, title, url, content) for storing web page information by crawling. Now i want to search that index with multiple word queries( also Boolean queries), suggest good n efficient searching algorithms(some examples) and efficient parsing. Please help
Asked
Active
Viewed 457 times
1 Answers
0
Do you want to search only the title, or the content too? Assuming you want to allow partial searches on the title, which return an URL and/or the content, the schema would be:
schema = Schema(id=ID(stored=True), title=NGRAM(minsize=2, maxsize=20,stored=True, sortable=ranking_col), url=STORED(), content=STORED())
This works fine with standard Whoosh searcher up to ~1000000 titles. For more entries, the ngram index will became very big and slow.
Also, use stopwords to reduce the index size:
stopwords = set(['of', 'by', 'the','in','for','a']) #words to be excluded from the index
def create_whoosh(self):
writer = ix.writer()
for t in documents:
words = [t for t in t.title.split(" ") if t not in stopwords] #remove stopwords
writer.add_document(title=" ".join(words), url=t.url, content=t.content)
writer.commit()
The searcher :
def lookup(self, terms):
with ix.searcher() as src:
query = QueryParser("term", ix.schema).parse(terms)
results = src.search(query, limit=30)
return [[r['url'],r['content']] for r in results]
If searching for full words in title and content is intended, you can do:
schema = Schema(id=ID(stored=True), title=TEXT(stored=True), url=STORED(), content=TEXT(stored=True))
This will not work for substring search, but can cope well with some milions of documents (depending on the size of content)
For indexing ~10 milions of documents, you'll need to separately store the content in some sort of DB and lookup only the ID with whoosh.

Nutiu Lucian
- 169
- 2
- 5