31

Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated (closed - will not fix) JIRA:

https://issues.apache.org/jira/browse/SPARK-10943

So what are folks doing with regards to null column values today when writing out dataframe's to parquet ? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null - short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

2 Answers2

38

You misinterpreted SPARK-10943. Spark does support writing null values to numeric columns.

The problem is that null alone carries no type information at all

scala> spark.sql("SELECT null as comments").printSchema
root
 |-- comments: null (nullable = true)

As per comment by Michael Armbrust all you have to do is cast:

scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)

and the result can be safely written to Parquet.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • 1
    If you're using sql something like this works: create table my_table as select distinct colx, coly, cast(null as string) colz from foo where lower(bar) like '%bat%' – John Oct 28 '20 at 13:37
  • Is it possible to somehow cast fields of a StructType? I mean, there is a column which type is a struct which has a field which type is inferred as NullType. Is it possible to cast it as well? – Roman Puchkovskiy Aug 11 '21 at 11:02
14

I wrote a PySpark solution for this (df is a dataframe with columns of NullType):

# get dataframe schema
my_schema = list(df.schema)

null_cols = []

# iterate over schema list to filter for NullType columns
for st in my_schema:
    if str(st.dataType) == 'NullType':
        null_cols.append(st)

# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
    mycolname = str(ncol.name)
    df = df \
        .withColumn(mycolname, df[mycolname].cast('string'))
ZygD
  • 22,092
  • 39
  • 79
  • 102
Daniel
  • 394
  • 6
  • 15