I'm new to pyspark and elasticsearch. All I'm trying to do, is to read an index from opensearch (v7.10.2
) and dump it as parquet to s3 using pyspark (v3.2.1
), running on databricks.
I manage to load the schema successfully by the index mapping, like so:
df = spark.read.format("org.elasticsearch.spark.sql").options(**es_conf).option("mode", "PERMISSIVE").load("index_name")
df.printSchema() # that works
Although all fields seem to be nullable, I'm having trouble taking any further actions, be it trying to write parquet or just doing something simple like df.show()
. Instead, I'm getting the following error:
Position for 'some_nested_field.something.some_id' not found in row; typically this is caused by a mapping inconsistency
I'm guessing that happens since some of the docs lack these fields, which are part of the mapping (and schema, accordingly), but all of them are nullable and I don't care having them unpopulated.
So anyway, I don't get it. I don't care about any "inconsistencies". I'd like to tell spark something like: if there are some fields in the mapping / schema that are not populated in some doc, just put null values or something instead.
I tried: permissive-mode, passing the schema externally instead of loading it from the mapping, upgrading to the latest jars - nothing have worked for me so far.
Anything I can do to just dump elasticsearch index with its existing inconsistencies to parquet?
Full error log:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.65.9.255 executor 0): org.elasticsearch.hadoop.rest.EsHadoopParsingException: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'nested_field.some_other_struct_field.and_then_another.int_value' not found in row; typically this is caused by a mapping inconsistency
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:520)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:298)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:262)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:313)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:94)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:66)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'nested_field.some_other_struct_field.and_then_another.int_value' not found in row; typically this is caused by a mapping inconsistency
at org.elasticsearch.spark.sql.RowValueReader.addToBuffer(RowValueReader.scala:60)
at org.elasticsearch.spark.sql.RowValueReader.addToBuffer$(RowValueReader.scala:55)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
at org.elasticsearch.hadoop.serialization.ScrollReader.readListItem(ScrollReader.java:929)
at org.elasticsearch.hadoop.serialization.ScrollReader.list(ScrollReader.java:981)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:882)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
at org.elasticsearch.hadoop.serialization.ScrollReader.readListItem(ScrollReader.java:929)
at org.elasticsearch.hadoop.serialization.ScrollReader.list(ScrollReader.java:981)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:882)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:608)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:432)
... 29 more
Driver stacktrace: