2

Im using spark to write my json data to s3. However, I keep getting the below error. We are using apache hudi for updates. This only happens for some data, everything else works fine.

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 
 in file s3a://<path to parquet file>
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.ja va:251)

App > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)

App > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)

 App > at com.uber.hoodie.func.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:45)

App > at com.uber.hoodie.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:44)

App > at com.uber.hoodie.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:94)

App > at java.util.concurrent.FutureTask.run(FutureTask.java:266)

 App > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

App > ... 4 more

App > Caused by: java.lang.UnsupportedOperationException:org.apache.parquet.avro.AvroConverters$FieldLongConverter

I am unable to understand. I followed a few threads and set --conf "spark.sql.parquet.writeLegacyFormat=true" in my spark confs. but even this didnt help.

byte_array
  • 2,767
  • 1
  • 16
  • 10
mythic
  • 535
  • 7
  • 21

1 Answers1

5

Found out the issue. The issue was with schema mismatch in existing parquet files and incoming data. One of the fields was string in existing parquet schema, and it was being sent as long in the newer chunk of data.

mythic
  • 535
  • 7
  • 21
  • so how did you solve it? can you give some steps on what you did which solved this issue. – timedacorn Apr 18 '23 at 09:17
  • As far as I recollect, I casted the data to string in the new data and made it compatible with the existing data. Nothing complex! Just make the data types common across all files. Hope this helps! – mythic May 26 '23 at 09:07