Until recently parquet
did not support null
values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark
supports that new parquet
feature - if ever. Here is the associated (closed - will not fix
) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe
's to parquet
? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null
- short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).