0

I'm new to pyspark and elasticsearch. All I'm trying to do, is to read an index from opensearch (v7.10.2) and dump it as parquet to s3 using pyspark (v3.2.1), running on databricks.

I manage to load the schema successfully by the index mapping, like so:

df = spark.read.format("org.elasticsearch.spark.sql").options(**es_conf).option("mode", "PERMISSIVE").load("index_name")

df.printSchema() # that works

Although all fields seem to be nullable, I'm having trouble taking any further actions, be it trying to write parquet or just doing something simple like df.show(). Instead, I'm getting the following error:

Position for 'some_nested_field.something.some_id' not found in row; typically this is caused by a mapping inconsistency

I'm guessing that happens since some of the docs lack these fields, which are part of the mapping (and schema, accordingly), but all of them are nullable and I don't care having them unpopulated.

So anyway, I don't get it. I don't care about any "inconsistencies". I'd like to tell spark something like: if there are some fields in the mapping / schema that are not populated in some doc, just put null values or something instead.

I tried: permissive-mode, passing the schema externally instead of loading it from the mapping, upgrading to the latest jars - nothing have worked for me so far.

Anything I can do to just dump elasticsearch index with its existing inconsistencies to parquet?

Full error log:

Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.65.9.255 executor 0): org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'nested_field.some_other_struct_field.and_then_another.int_value' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:520)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:298)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:262)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:313)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:94)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:66)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
    at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
    at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
    at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.scheduler.Task.run(Task.scala:95)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'nested_field.some_other_struct_field.and_then_another.int_value' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.spark.sql.RowValueReader.addToBuffer(RowValueReader.scala:60)
    at org.elasticsearch.spark.sql.RowValueReader.addToBuffer$(RowValueReader.scala:55)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readListItem(ScrollReader.java:929)
    at org.elasticsearch.hadoop.serialization.ScrollReader.list(ScrollReader.java:981)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:882)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readListItem(ScrollReader.java:929)
    at org.elasticsearch.hadoop.serialization.ScrollReader.list(ScrollReader.java:981)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:882)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:608)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:432)
    ... 29 more

Driver stacktrace:
Kludge
  • 2,653
  • 4
  • 20
  • 42

0 Answers0