Create documents that not exist, skip others

Question

I'm working in a concurrent environment when index being built by Spark job may receive updates for same document id from the job itself and other sources. It is assumed that updates from other sources are more fresh and Spark job needs to silently ignore documents that already exist, creating all other documents. This is very close to indexing with op_type: create, but the latter throws an exception that is not passed to my error handler. Following block of code:

          .rdd
          .repartition(getTasks(configurationManager))
          .saveJsonToEs(
            s"$indexName/_doc",
            Map(
              "es.mapping.id" -> MenuItemDocument.ID_FIELD,
              "es.write.operation" -> "create",
              "es.write.rest.error.handler.bulkErrorHandler" ->
                "<some package>.IgnoreExistsBulkWriteErrorHandler",
              "es.write.rest.error.handlers" -> "bulkErrorHandler"
            )
          )

where error handler survived several variations, but currently is:

class IgnoreExistsBulkWriteErrorHandler extends BulkWriteErrorHandler with LazyLogging {
  override def onError(entry: BulkWriteFailure, collector: DelayableErrorCollector[Array[Byte]]): HandlerResult = {
    logger.info("Encountered exception:", entry.getException)
    if (entry.getException.getMessage.contains("version_conflict_engine_exception")) {
      logger.info("Encountered document already present in index, skipping")
      HandlerResult.HANDLED
    } else {
      HandlerResult.ABORT
    }
  }
}

(i obviously was checking for org.elasticsearch.index.engine.VersionConflictEngineException in getException().getCause() first, but it didn't work)

emits this in log:

org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [186/1000]. Error sample (first [5] error messages):
    org.elasticsearch.hadoop.rest.EsHadoopRemoteException: version_conflict_engine_exception: [_doc][12]: version conflict, document already exists (current version [1])

(i assume that my error handler is not called at all)

and terminates my whole Spark job. What is the correct way to achieve my desired result?

Is it necessary to skip the updates of documents that already exist from a business perspective? I'm not a Spark expert but with the plain REST Api you can do an upsert; insert (create) documents if they don't exist and update them if already present. This way you wouldn't have to care about skipping updates. But this is only a solution if you could accept updates from a data/consistency perspective. — apt-get_install_skill, Jan 07 '22 at 12:05
From business perspective we can run into race conditions, stale data and manual launching update pipeline for particular documents. For the particular task, we've decided we can go with it for now, but I want to know how this should be done in general. I've also tried upserts in bulk with a no-op script, but have run into other issues (probably version checks, I don't remember it right now), so I still don't have a solution at hand. — Etki, Jan 08 '22 at 19:58

Create documents that not exist, skip others

0 Answers0