"Parquet record is malformed" while column count is not 0

Question

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:

Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
    at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
    at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more

I've read that this could happen if there were some columns with null values only, but after checking all column counts that is not the case. None of the columns is completely empty. Instead of using parquet, I tried to write the results to a text file and everything went smoothly.

Any clue what could trigger this error? Here are all the data types used in this table. There are 51 columns in total.

'array<bigint>',
'array<char(50)>',
'array<smallint>',
'array<string>',
'array<varchar(100)>',
'array<varchar(50)>',
'bigint',
'char(16)',
'char(20)',
'char(4)',
'int',
'string',
'timestamp',
'varchar(255)',
'varchar(50)',
'varchar(87)'

looks like you have empty arrays (`[]`), try to replace it with `null` — shuvalov, Jan 10 '20 at 06:50
If a column has a mix of `null` values and `[]` it could appear as an empty column? That could make sense, I'll try — Shinagan, Jan 10 '20 at 12:29
make sure that the up-stream which generated the parquet and the current job which is reading the parquet are of the same parquet versions — chendu, Jan 13 '20 at 07:41
One other workaround, provided you can control the file format, is to use [`ORC`](https://orc.apache.org/) instead of `parquet` - there empty arrays are ok. It's equally supported by all big data tools. — botchniaque, Oct 15 '20 at 20:31

score 18 · Answer 1 · answered Jan 14 '20 at 18:14

18

Turns out Parquet does not support empty arrays. This error will be triggered if there is one or more empty arrays (of any type) in the table.

One workaround is to cast the empty arrays to NULL values.

answered Jan 14 '20 at 18:14

Shinagan

435
3
15

score 0 · Answer 2 · answered Jun 11 '20 at 19:34

It looks like you are using one of Spark's Hive write paths (org.apache.hadoop.hive.ql.io.parquet.write). I was able to work around this issue by instead writing directly to parquet, then later adding partitions to any Hive table needed.

df.write.parquet(your_path)
spark.sql(f"""
    ALTER TABLE {your_table}
    ADD PARTITION (partition_spec) LOCATION '{your_path}'
    """)

score 0 · Answer 3 · answered Oct 20 '22 at 07:28

0

As Shinagan wrote, you can check if the array is empty and set it to Null.

You can do it by using the cardinality function:

case when cardinality(array_x) = 0 then null else array_x end

answered Oct 20 '22 at 07:28

HagaiA

193
3
15

"Parquet record is malformed" while column count is not 0

3 Answers3