I'm working in a concurrent environment when index being built by Spark job may receive updates for same document id from the job itself and other sources. It is assumed that updates from other sources are more fresh and Spark job needs to silently ignore documents that already exist, creating all other documents. This is very close to indexing with op_type: create, but the latter throws an exception that is not passed to my error handler. Following block of code:
.rdd
.repartition(getTasks(configurationManager))
.saveJsonToEs(
s"$indexName/_doc",
Map(
"es.mapping.id" -> MenuItemDocument.ID_FIELD,
"es.write.operation" -> "create",
"es.write.rest.error.handler.bulkErrorHandler" ->
"<some package>.IgnoreExistsBulkWriteErrorHandler",
"es.write.rest.error.handlers" -> "bulkErrorHandler"
)
)
where error handler survived several variations, but currently is:
class IgnoreExistsBulkWriteErrorHandler extends BulkWriteErrorHandler with LazyLogging {
override def onError(entry: BulkWriteFailure, collector: DelayableErrorCollector[Array[Byte]]): HandlerResult = {
logger.info("Encountered exception:", entry.getException)
if (entry.getException.getMessage.contains("version_conflict_engine_exception")) {
logger.info("Encountered document already present in index, skipping")
HandlerResult.HANDLED
} else {
HandlerResult.ABORT
}
}
}
(i obviously was checking for org.elasticsearch.index.engine.VersionConflictEngineException in getException().getCause() first, but it didn't work)
emits this in log:
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [186/1000]. Error sample (first [5] error messages):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception: [_doc][12]: version conflict, document already exists (current version [1])
(i assume that my error handler is not called at all)
and terminates my whole Spark job. What is the correct way to achieve my desired result?