1

I am trying to reindex an existing index into a new one with different analyzers.

es.reindex({
        "source": {"index": "index"},
        "dest": {"index": settings.ES_INDEX}
    }, wait_for_completion=True, request_timeout=36000)

The data is pretty big, more than 30 000 000 documents. But after 20 hours of working I got an exception

"failures":[{"index":"index_ru","type":"page","id":"rLKzd2wBwqP9nlBSLbNG","cause":{"type":"mapper_parsing_exception","reason":"failed to parse [raw_html]","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x20\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@130873db; line: 1, column: 769928]"}},"status":400}]}

Only 30% of the documents were processed. Apparently, there is a problem with encoding in one of the documents. Because of that reindex process was stopped... Is there a way to ignore such problems? I don't care if a few documents won't be reindexed.

Thank you.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user1354033
  • 203
  • 2
  • 10

0 Answers0