parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

Question

I have saved a remote DB table in Hive using saveAsTable method, now when i try to access the Hive table data using CLI command select * from table_name, It's giving me the error below:

2016-06-15 10:49:36,866 WARN  [HiveServer2-Handler-Pool: Thread-96]:
thrift.ThriftCLIService (ThriftCLIService.java:FetchResults(681)) -
Error fetching results: org.apache.hive.service.cli.HiveSQLException:
java.io.IOException: parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file hdfs:

Any idea what I might be doing wrong here?

Can you print your data Schema for that table? – Wong Tat Yau Nov 15 '17 at 06:27 — Wong Tat Yau, Nov 15 '17 at 06:27

score 26 · Answer 1 · answered Apr 09 '18 at 05:28

Problem: Facing below issue while querying the data in impyla (data written by spark job)

ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Root Cause:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

--conf "spark.sql.parquet.writeLegacyFormat=true"

References:

I write data to hive using the spark application,when I go to read the Hive table in Hue,this error occurred.This worked for me, Thanks! — aof, Jun 10 '19 at 10:08
I'm facing this issue when I'm reading a parquet file created using spark in Java. The parquet file is read successfully when using Java AvroParquetReader, but when I dockerize the java application and read the file, I get this error. Please help. — Sugyan sahu, Aug 27 '21 at 10:42

score 4 · Answer 2 · answered Aug 14 '17 at 10:38

I had a similar error (but at a positive index in a non-negative block), and it came from the fact that I had created the Parquet data with some Spark dataframe types marked as non-nullable when they were actually null.

In my case, I thus interpret the error as Spark attempting to read data from a certain non-nullable type and stumbling across an unexpected null value.

To add to the confusion, after reading the Parquet file, Spark reports with printSchema() that all the fields are nullable, whether they are or not. However, in my case, making them really nullable in the original Parquet file solved the problem.

Now, the fact that the question happens at "0 in block -1" is suspicious: it actually almost looks as if the data was not found, since block -1 looks like Spark has not even started reading anything (just a guess).

score 3 · Answer 3 · answered Nov 15 '17 at 08:41

It looks like a schema mismatch problem here. If you set your schema to be not nullable, and create your dataframe with None value, Spark would throw you ValueError: This field is not nullable, but got None error.

[Pyspark]

from pyspark.sql.functions import * #udf, concat, col, lit, ltrim, rtrim
from pyspark.sql.types import *  

schema = ArrayType(StructType([StructField('A', IntegerType(), nullable=False)]))
# It will throw "ValueError". 
df = spark.createDataFrame([[[None]],[[2]]],schema=schema) 
df.show()

But it is not the case if you use udf.

Using the same schema, if you use udf for transformation, it won't throw you ValueError even if your udf return a None. And it is the place where data schema mismatch happens.

For example:

df = spark.createDataFrame([[[1]],[[2]]], schema=schema)

def throw_none():
    def _throw_none(x):
        if x[0][0] == 1:
            return [['I AM ONE']] 
        else:
            return x 
    return udf(_throw_none, schema) 

# since value col only accept intergerType, it will throw null for 
# string "I AM ONE" in the first row. But spark did not throw ValueError
# error this time ! This is where data schema type mismatch happen !
df = df.select(throw_none()(col("value")).name('value'))  
df.show()

Then, the following parquet write and read will throw you the parquet.io.ParquetDecodingException error.

df.write.parquet("tmp")
spark.read.parquet("tmp").collect()

So be very careful on the null value if you are using udf, return the right data type in your udf. And unless it is unnecessary, please dont set nullable=False in your StructField. Set nullable=True will solve all the problem.

score 1 · Answer 4 · answered Feb 05 '20 at 18:33

One more way to catch possible discrepancy is to eyeball the difference in schemata of parquet files produced by both sources, say hive and spark. You can dump schema with parquet-tools (brew install parquet-tools for macos):

λ $ parquet-tools schema /usr/local/Cellar/apache-drill/1.16.0/libexec/sample-data/nation.parquet
message root {
  required int64 N_NATIONKEY;
  required binary N_NAME (UTF8);
  required int64 N_REGIONKEY;
  required binary N_COMMENT (UTF8);
}

score 1 · Answer 5 · answered Dec 28 '20 at 22:18

1

I had a similar error, in my case i was missing the default constructor

answered Dec 28 '20 at 22:18

Inna Katz

11
1

Hello Inna! Welcome to Stack Overflow! Thanks for contributing! Be sure to specify what your "similar" error was, and provide an explanation as to why it solved the problem. In this case, why does the missing default constructor cause the issue the OP described? – armitus Dec 29 '20 at 00:15

score 0 · Answer 6 · answered Jan 25 '17 at 01:33

Are you able to use Avro instead of Parquet to store your Hive table? I ran into this issue because I was using Hive's Decimal datatype, and Parquet from Spark doesn't play nice with Decimal. If you post your table schema and some data samples, debugging will be easier.

Another possible option, from the DataBricks Forum, is to use a Double instead of a Decimal, but that was not an option for my data so I can't report on whether it works.

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

6 Answers6

Linked