1

Let's say I've got a simple pipe delimited file, with missing values:

A|B||D

I read that into a dataframe:

val foo =  spark.read.format("csv").option("delimiter","|").load("/path/to/my/file.txt")

The missing third column, instead of being a null value, has the string null:

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

I would like to be able to control what Spark does with that missing value. According to the CSV Documentation, there is a nullValue option:

nullValue: string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame.

But since I don't have any value there, I can't seem to control it that way.

Short of testing every single field of every single file I read in, is there way to control what spark does with these blanks? For example, replace it with 'N/A'.

We're using Spark 2.1, if it's relevant.

Andrew
  • 8,445
  • 3
  • 28
  • 46
  • I don't think there is an option for that. You can always transform the result whatever way you want. If you need I can give an example for that – Salim Jan 23 '20 at 16:52
  • Have you tried either option("nullValue", '') or option("emptyValue", '') – K S Nidhin Jan 24 '20 at 15:58

1 Answers1

0

The nullValue and emptyValue options do the opposite of your expectation - they allow you to specify values that, if encountered in the source data, should be turned into null or "" (respectively) in the resultant dataframe. For instance, if your input was instead

a|b|N/A|d

and you set

option("nullValue", "N/A")

you would wind up once again with

+-----------+----+
|_c0|_c1| _c2|_c3|
+-----------+----+
|a  |b  |null|c  |

More to the point, Spark includes functions for dealing with null values in the class DataFrameNaFunctions. For example, df.na.fill("N/A") will replace all null instances in string columns with "N/A".

Charlie Flowers
  • 1,287
  • 7
  • 12