I am trying to use Whoosh to index a large corpus (roughly 25 million academic abstracts + titles). I marked the "abstract" field with vector=True
because I need to be able to compute high scoring key terms based on the abstracts for similarity IR.
However after about 4 million entries during indexing it crashed with the following error:
Traceback (most recent call last):
File "...", line 256, in <module>
...
File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/writing.py", line 771, in add_document
perdocwriter.add_vector_items(fieldname, field, vitems)
File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/whoosh3.py", line 244, in add_vector_items
self.add_column_value(vecfield, VECTOR_COLUMN, offset)
File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/codec/base.py", line 821, in add_column_value
self._get_column(fieldname).add(self._docnum, value)
File "/home/nlp/*/anaconda3/envs/riken/lib/python3.6/site-packages/whoosh/columns.py", line 678, in add
self._dbfile.write(self._pack(v))
struct.error: 'I' format requires 0 <= number <= 4294967295
Schema:
schema = Schema(title=TEXT(stored=False, phrase=False, field_boost=2.0, analyzer=my_analyzer, vector=True),
abstract=TEXT(stored=False, phrase=False, analyzer=my_analyzer, vector=True),
pmid=ID(stored=True),
mesh_set=KEYWORD(stored=True, scorable=True),
stored_title=STORED,
stored_abstract=STORED)
The index folder currently weights around 45GB. What exactly is the issue here? Is Whoosh simply not built to carry this amount of data?