How to handle null values when writing to parquet from Spark

Question

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

Alper t. Turker · Accepted Answer · 2018-05-03T17:39:06.270

38

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

and the result can be safely written to Parquet.

edited May 03 '18 at 17:39

answered May 03 '18 at 17:32

Alper t. Turker

34,230
9
83
115

1

If you're using sql something like this works: create table my_table as select distinct colx, coly, cast(null as string) colz from foo where lower(bar) like '%bat%' – John Oct 28 '20 at 13:37
Is it possible to somehow cast fields of a StructType? I mean, there is a column which type is a struct which has a field which type is inferred as NullType. Is it possible to cast it as well? – Roman Puchkovskiy Aug 11 '21 at 11:02

score 14 · Answer 2 · edited Oct 17 '21 at 09:45

14

I wrote a PySpark solution for this (df is a dataframe with columns of NullType):

# get dataframe schema
my_schema = list(df.schema)

null_cols = []

# iterate over schema list to filter for NullType columns
for st in my_schema:
    if str(st.dataType) == 'NullType':
        null_cols.append(st)

# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
    mycolname = str(ncol.name)
    df = df \
        .withColumn(mycolname, df[mycolname].cast('string'))

edited Oct 17 '21 at 09:45

ZygD

22,092
39
79
102

answered Oct 28 '19 at 09:36

Daniel

394
6
15

3

This solution can be extended to handle nested NullType columns buy changing the line to `if 'NullType' in str(st.dataType):` – miked Dec 15 '21 at 03:36
What if the NullType columns are null arrays? – Omega Jan 19 '22 at 21:42

How to handle null values when writing to parquet from Spark

2 Answers2

Linked