Index large document with Solar causes exception

Question

I am trying to add documents to Solr (5.3.2) with pysolr. I generate a simple JSON object containing a large text and some metadata (date, author...) then I try to add that to Solr. My issue is that beyond a certain size, Solr will fail to index the document and return the following error :

Solr responded with an error (HTTP 400): [Reason: Exception writing document id e2699f18-ab5f-47f6-a450-60db5621879c to the index; possible analysis error.]

There really seems to be a hardcoded limit somewhere on the field length, but I can't find it.

By playing around in python I found out that :

default_obj['content'] = content[:13260]

will work fine while

default_obj['content'] = content[:13261]

will cause an error.

The content field is defined in my schema.xml as a normal type="text_general" field.

Edit: Here are the schema.xml definitions

<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>


<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I have tried adding the content manually through Solr's web admin interface, but I get the exact same problem.

could you share your schema.xml for the field you're trying to add this content — Mysterion, Oct 12 '17 at 12:03

score 0 · Answer 1 · answered Oct 12 '17 at 13:07

0

Most likely you face the hard limit on the side of single token, which is equals to 32766. You couldn't change this limit, however, you could change the behavior and use some Tokenizer to split your original text in the document into separate tokens.

For example, you could try WhitespaceTokenizer, which will separate your big field in multiple terms/tokens and your documents will be safely indexed.

answered Oct 12 '17 at 13:07

Mysterion

9,050
3
30
52

It's strange, the text is properly spaced, and the longest token is probably only 10 characters long. – user2969402 Oct 12 '17 at 17:48

Index large document with Solar causes exception

1 Answers1