4

I have a few indexes located in the same directory as part of a storage object.

from whoosh.filedb.filestore import FileStorage
storage = FileStorage("../indexdir")
ix_1 = storage.open_index(indexname='ind_1')
ix_2 = storage.open_index(indexname='ind_2')

I want to be able to search a query through BOTH indexes at the same time and not just one of them. Is it possible to do that without having a single index? I can append the results of each index one after the other but I can't figure out how to sort them or if that is even possible.

Yasmina
  • 51
  • 3

1 Answers1

2

Updated on March 8th, 2021

Based on comments

Loading modules

from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser

import pandas as pd

Defining search term

TERM = "second"

Creating indices for this example

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
ix1 = create_in("indexdir1", schema)

schema = Schema(title=TEXT, path=ID(stored=True), content=TEXT(stored=True))
ix2 = create_in("indexdir2", schema)

writer1 = ix1.writer()
writer2 = ix2.writer()

for writer in [writer1, writer2]:
    writer.add_document(title=u"First document", path=u"/a",
                        content=u"This is the first document we've added! Not the second")
    writer.add_document(title=u"Second document", path=u"/b",
                        content=u"The second one is even more interesting than the first one!")
    writer.add_document(title=u"Third document", path=u"/c",
                        content=u"You know... This is also different from the second one!")
    writer.commit()

Searching by the term

results = []

parser = QueryParser("title", ix1.schema)
query = parser.parse(TERM)
results += list(ix1.searcher().search(query))

parser = QueryParser("content", ix1.schema)
query = parser.parse(TERM)
results += list(ix2.searcher().search(query))

So far your results are

print(results)

[<Hit {'path': '/b', 'title': 'Second document'}>, <Hit {'content': "This is the first document we've added! Not the second", 'path': '/a'}>, <Hit {'content': 'You know... This is also different from the second one!', 'path': '/c'}>, <Hit {'content': 'The second one is even more interesting than the first one!', 'path': '/b'}>]

Although the results are together, they are not ordered by anything.

Transforming it into a dictionary data structure

result = {"path": [], "title": [], "content": []}
fields = ["path", "title", "content"]

for dct in results:
    for field in fields:
        result[field].append(dct.get(field, None))

Creating a pandas dataframe with the results

df = pd.DataFrame(result)
print(df)

The dataframe is:

  path            title                                            content
0   /b  Second document                                               None
1   /a             None  This is the first document we've added! Not th...
2   /c             None  You know... This is also different from the se...
3   /b             None  The second one is even more interesting than t...

Note Where you get None is because it doesn't match with the search

Grouping the results by "path" and counting results

groups = df.groupby(["path"]).count()

The groups

      title  content
path                
/a        0        1
/b        1        1
/c        0        1

Creating a score column

groups["score"] = groups["title"] + groups["content"]

With the score column

      title  content  score
path                       
/a        1        1      2
/b        0        1      1
/c        0        1      1

Sorting results by score

print(groups.sort_values("score", ascending=False))
      title  content  score
path                       
/b        1        1      2
/a        0        1      1
/c        0        1      1

Note Although is in the same order as the one printed before it may not be the case in the real world

Finally, you can iterate through the dataframe and present your results.

End of update


Note You will find the first answer below. Specifically for the "at once" part. After some comments, I updated the post and decided to keep this here, because it might be useful.


Why don't use concurrent.futures for that?

Starting

import concurrent.futures
from whoosh.filedb.filestore import FileStorage

storage = FileStorage("../indexdir")
ix_1 = storage.open_index(indexname='ind_1')
ix_2 = storage.open_index(indexname='ind_2')

ixs = [ix_1, ix_2]

TERM = "TERM TO SEARCH"

Defining search function

def search_things(ix, term):
    with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse(term)
        results = searcher.search(query, terms=True)
    return results

Parallelizing

# using two workers because there are two indices
with concurrent.futures.ThreadPoolExecutor(max_workers = 2) as executor:
   future_to_search = {executor.submit(search_things, ix, TERM): ix for ix in ixs}

   for future in concurrent.futures.as_completed(future_to_search):
       s = future_to_search[future]
       try:
          data = future.result()
       except Exception as exc:
          print('%r generated an exception: %s' % (s, exc))
       else:
          print('Search (%r) finished => %d' % (s, data))

You might need to adapt it for your needs.

Paulo Marques
  • 775
  • 4
  • 15
  • Parallelism is hardly the issue here, the problem is in the search results. This wouldn't work on search_page. Also, the sorting relevancy would be screwed up. – Samy Mar 06 '21 at 04:45
  • I think the the optimal solution will somehow involve searching multiple indexes as if they were one. – Samy Mar 06 '21 at 04:47
  • Well, the question was how to search them at the same time. If there is the need for joining results, then having two indexes maybe is wrong. – Paulo Marques Mar 06 '21 at 05:49
  • Read the last part of the question: "I can append the results of each index one after the other but I can't figure out how to sort them or if that is even possible". The problem with having one monolithic index is that it doesn't scale very well. You can concurrently search multiple index, but that wouldn't solve the issue of having to sort them, especially if the search is paginated. – Samy Mar 06 '21 at 16:53