1

I am trying to use Whoosh to index a large corpus (roughly 25 million academic abstracts + titles). I marked the "abstract" field with vector=True because I need to be able to compute high scoring key terms based on the abstracts for similarity IR.

However after about 4 million entries during indexing it crashed with the following error:

Traceback (most recent call last):
  File "...", line 256, in <module>
    ...
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/writing.py", line 771, in add_document
    perdocwriter.add_vector_items(fieldname, field, vitems)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/whoosh3.py", line 244, in add_vector_items
    self.add_column_value(vecfield, VECTOR_COLUMN, offset)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/base.py", line 821, in add_column_value
    self._get_column(fieldname).add(self._docnum, value)
  File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/columns.py", line 678, in add
    self._dbfile.write(self._pack(v))
struct.error: 'I' format requires 0 <= number <= 4294967295

Schema:

schema = Schema(title=TEXT(stored=False, phrase=False, field_boost=2.0, analyzer=my_analyzer, vector=True),
    abstract=TEXT(stored=False, phrase=False, analyzer=my_analyzer, vector=True),
    pmid=ID(stored=True),
    mesh_set=KEYWORD(stored=True, scorable=True),
    stored_title=STORED,
    stored_abstract=STORED)

The index folder currently weights around 45GB. What exactly is the issue here? Is Whoosh simply not built to carry this amount of data?

polm23
  • 14,456
  • 7
  • 35
  • 59
  • I would post your question on the whoosh mailing list https://groups.google.com/forum/#!forum/whoosh, there is a post there somebody claiming to index 13M documents, so maybe there is a work around for 32 bit issue ? – Tomasz Swider Dec 27 '18 at 07:14

1 Answers1

0

It looks like the field that is used as a document index is only designed to be a 32-bit unsigned int, which gives you a limit of roughly 4M documents.

Based on this issue in the official Whoosh repository, simply changing the size of that field causes problems elsewhere, so it can't be solved trivially.

Since Whoosh is not actively maintained, unless you want to dig into the source you should probably explore alternatives.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • Can you think of any alternatives that still involve Whoosh? Is it possible to use multiple indexes in a way that mimics a single index, or some other trick? – Chum-Chum Scarecrows Dec 27 '18 at 07:15