7

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:

Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
    at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
    at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more

I've read that this could happen if there were some columns with null values only, but after checking all column counts that is not the case. None of the columns is completely empty. Instead of using parquet, I tried to write the results to a text file and everything went smoothly.

Any clue what could trigger this error? Here are all the data types used in this table. There are 51 columns in total.

'array<bigint>',
'array<char(50)>',
'array<smallint>',
'array<string>',
'array<varchar(100)>',
'array<varchar(50)>',
'bigint',
'char(16)',
'char(20)',
'char(4)',
'int',
'string',
'timestamp',
'varchar(255)',
'varchar(50)',
'varchar(87)'
Shinagan
  • 435
  • 3
  • 15
  • 2
    looks like you have empty arrays (`[]`), try to replace it with `null` – shuvalov Jan 10 '20 at 06:50
  • If a column has a mix of `null` values and `[]` it could appear as an empty column? That could make sense, I'll try – Shinagan Jan 10 '20 at 12:29
  • make sure that the up-stream which generated the parquet and the current job which is reading the parquet are of the same parquet versions – chendu Jan 13 '20 at 07:41
  • @shuvalov that was the right answer! – Shinagan Jan 14 '20 at 18:16
  • One other workaround, provided you can control the file format, is to use [`ORC`](https://orc.apache.org/) instead of `parquet` - there empty arrays are ok. It's equally supported by all big data tools. – botchniaque Oct 15 '20 at 20:31

3 Answers3

18

Turns out Parquet does not support empty arrays. This error will be triggered if there is one or more empty arrays (of any type) in the table.

One workaround is to cast the empty arrays to NULL values.

Shinagan
  • 435
  • 3
  • 15
0

It looks like you are using one of Spark's Hive write paths (org.apache.hadoop.hive.ql.io.parquet.write). I was able to work around this issue by instead writing directly to parquet, then later adding partitions to any Hive table needed.

df.write.parquet(your_path)
spark.sql(f"""
    ALTER TABLE {your_table}
    ADD PARTITION (partition_spec) LOCATION '{your_path}'
    """)
0

As Shinagan wrote, you can check if the array is empty and set it to Null.

You can do it by using the cardinality function:

case when cardinality(array_x) = 0 then null else array_x end
HagaiA
  • 193
  • 3
  • 15