Replacing null value with blank value in spark data frame not working

Question

I have two dataset, dataset 1 is below

LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
Japan|^|1507101869432|^|4295876606|^|1|^|BAL|^|Cash And Deposits|^|null|^|null|^|ACAE|^|false|^|null|^|null|^|null|^|null|^|false|^|null|^|null|^|null|^|null|^|505126|^|505074|^|null|^|null|^|null|^|null|^|null|^|null|^|null|^|3018759|^|null|^|I|!|

And this is how i load data with auto discover schema

val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)

Dataset 2:

4295867927|^|860|^|CUS|^|External Revenue|^||^||^|REXR|^|False|^||^||^||^||^|False|^|False|^|CUS_REXR|^||^||^|505074|^|505074|^|505074|^|505074|^|505074|^||^|505074|^|True|^||^|3015250|^||^|I|!|

I create a data frame out of both and then do join. Finally i write output of both data frame in the csv file.

Here is the code to write into the csv file.

val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(dfMainOutputFinal.col_*, "null", "")).show()

dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

All are working fine except .option("nullValue", ""). I'm not able to replace null with blank value.

In my output i still see the null values.

I have tried this also but got same result.

val newDf = df.na.fill("e",Seq("blank"))

Are you sure the datasets contain actual *nulls* and not strings "null"? — Shaido, Oct 09 '17 at 06:49
so will it replace only null not "null"? What if i have null string ... — Sudarshan kumar, Oct 09 '17 at 06:50
A string that is *null* is fine, but if you have a string with the letters "null" it's treated as a normal string. — Shaido, Oct 09 '17 at 06:53
If you actually have a string "null", an option would be to simply replace all occurrences of "null" with "". — Shaido, Oct 09 '17 at 07:28
So in the sane option in place of nullValue i can put null ? Or i have to create another data frame derived from first one replacing null — Sudarshan kumar, Oct 09 '17 at 07:30
No, you can't. I will write an answer of how to do it if it's an string. Let's see if it works out. — Shaido, Oct 09 '17 at 07:33

score 2 · Accepted Answer · answered Oct 09 '17 at 07:37

2

I'm suspecting that the dataframe does not actually contain nulls, but that they are strings with the letters "null". If this is the case, then you can simply replace all instances of "null" with "". After this you can use .option("nullValue", "") as before. To replace string in columns, it's possible to use regexp_replace(column, "string to replace", "string to replace with"). S small example:

val df = Seq("a", "null", "c", "b").toDF("col1")
val df2 = df.withColumn("col1", regexp_replace(col("col1"), "null", ""))

Here the "null" was replaced with "" as wanted, the final dataframe looks like this:

+----+
|col1|
+----+
|   a|
|    |
|   c|
|   b|
+----+

answered Oct 09 '17 at 07:37

Shaido

27,497
23
70
73

Si in my case do i have to do this for all columns ?Can i do it before writing in to csv file ? – Sudarshan kumar Oct 09 '17 at 07:41
@SUDARSHAN Yes, you need to do it for all columns before writing the csv file. – Shaido Oct 09 '17 at 07:48
Then again schema will change and i will have problem in writing partition files – Sudarshan kumar Oct 09 '17 at 07:50
@SUDARSHAN You do not have to make any changes to the schema. as my example above with the "col1" column, both the input and output column is the same. Hence, the final dataframe have the same column names and same column types. – Shaido Oct 09 '17 at 07:54
@SUDARSHAN Looking at previous questions, I assume you are merging the columns yourself before writing the csv. The easiest way would be to do this step (with `regexp_replace`) only on the final, merged column. Then you only have to do it once. – Shaido Oct 09 '17 at 08:02
Ohh ok thats good ...I am trying ..But the writing syntax is very difficult in scala – Sudarshan kumar Oct 09 '17 at 08:03
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/156249/discussion-between-sudarshan-and-shaido). – Sudarshan kumar Oct 09 '17 at 08:14
Hi just one modification i need in my final data frame ..In case i want header to include in my csv output file then i get only one header as `concatenated` is there any way to put header as my first data frame schema header which is `df1result` ? If required i can create another question for this – Sudarshan kumar Oct 09 '17 at 09:16
@SUDARSHAN Yes, it would probably be better to create a new one. Then other people can weigh in on the best method and future persons with the same problem can easier find an answer. :) – Shaido Oct 09 '17 at 09:25
Yea sure i have created one ..https://stackoverflow.com/questions/46643166/how-to-change-header-in-spark-dataframe-with-another-data-frame-header – Sudarshan kumar Oct 09 '17 at 09:33

score 0 · Answer 2 · answered Jun 16 '20 at 18:10

option("nullValue", "whatever") checks if there is any column value "whatever" and treats that column value as null in the dataframe.

Just use the option during read and you will be fine.

 Dataset<Row> df = spark.read().format("csv")
              .option("nullValue", "NULL")      // this config does the trick
              .option("sep", ",")
              .schema(structType)
              .load(filePath);

Replacing null value with blank value in spark data frame not working

2 Answers2