I am trying to reindex
an existing index into a new one with different analyzers.
es.reindex({
"source": {"index": "index"},
"dest": {"index": settings.ES_INDEX}
}, wait_for_completion=True, request_timeout=36000)
The data is pretty big, more than 30 000 000 documents. But after 20 hours of working I got an exception
"failures":[{"index":"index_ru","type":"page","id":"rLKzd2wBwqP9nlBSLbNG","cause":{"type":"mapper_parsing_exception","reason":"failed to parse [raw_html]","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0x20\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@130873db; line: 1, column: 769928]"}},"status":400}]}
Only 30% of the documents were processed. Apparently, there is a problem with encoding in one of the documents. Because of that reindex process was stopped... Is there a way to ignore such problems? I don't care if a few documents won't be reindexed.
Thank you.