1

Given a large corpus of indexed documents with Whoosh I am trying to retrieve the titles (indexed field) with the associated document numbers.

How can I retrieve both document number and titles itemwise from the index?

Background: I indexed my corpus from a pandas df like this:

schema = Schema(content=TEXT(stored=True),
               abstract=TEXT(stored=True),
               title=TEXT(stored=True)) # create whoosh scheme

if not os.path.exists("indexdir"):
    os.mkdir("indexdir") # create index loc

ix = index.create_in("indexdir", schema) # create index


ix = index.open_dir("indexdir")
writer = ix.writer() # writerfunction


for index, row in df.iterrows(): #index preprocessed columns from df
    writer.add_document(title=row["new_title"], content=row["new_content"], abstract=row["new_abstract"]) # index documents


writer.commit() # end indexing and close 
Pete
  • 100
  • 4
  • 15

0 Answers0