-1

I have two dataset, dataset 1 is below

LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
Japan|^|1507101869432|^|4295876606|^|1|^|BAL|^|Cash And Deposits|^|null|^|null|^|ACAE|^|false|^|null|^|null|^|null|^|null|^|false|^|null|^|null|^|null|^|null|^|505126|^|505074|^|null|^|null|^|null|^|null|^|null|^|null|^|null|^|3018759|^|null|^|I|!|

And this is how i load data with auto discover schema

val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)

Dataset 2:

4295867927|^|860|^|CUS|^|External Revenue|^||^||^|REXR|^|False|^||^||^||^||^|False|^|False|^|CUS_REXR|^||^||^|505074|^|505074|^|505074|^|505074|^|505074|^||^|505074|^|True|^||^|3015250|^||^|I|!|

I create a data frame out of both and then do join. Finally i write output of both data frame in the csv file.

Here is the code to write into the csv file.

val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(dfMainOutputFinal.col_*, "null", "")).show()

dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

All are working fine except .option("nullValue", ""). I'm not able to replace null with blank value.

In my output i still see the null values.

I have tried this also but got same result.

val newDf = df.na.fill("e",Seq("blank"))
Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83

2 Answers2

2

I'm suspecting that the dataframe does not actually contain nulls, but that they are strings with the letters "null". If this is the case, then you can simply replace all instances of "null" with "". After this you can use .option("nullValue", "") as before. To replace string in columns, it's possible to use regexp_replace(column, "string to replace", "string to replace with"). S small example:

val df = Seq("a", "null", "c", "b").toDF("col1")
val df2 = df.withColumn("col1", regexp_replace(col("col1"), "null", ""))

Here the "null" was replaced with "" as wanted, the final dataframe looks like this:

+----+
|col1|
+----+
|   a|
|    |
|   c|
|   b|
+----+
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Si in my case do i have to do this for all columns ?Can i do it before writing in to csv file ? – Sudarshan kumar Oct 09 '17 at 07:41
  • @SUDARSHAN Yes, you need to do it for all columns before writing the csv file. – Shaido Oct 09 '17 at 07:48
  • Then again schema will change and i will have problem in writing partition files – Sudarshan kumar Oct 09 '17 at 07:50
  • @SUDARSHAN You do not have to make any changes to the schema. as my example above with the "col1" column, both the input and output column is the same. Hence, the final dataframe have the same column names and same column types. – Shaido Oct 09 '17 at 07:54
  • @SUDARSHAN Looking at previous questions, I assume you are merging the columns yourself before writing the csv. The easiest way would be to do this step (with `regexp_replace`) only on the final, merged column. Then you only have to do it once. – Shaido Oct 09 '17 at 08:02
  • Ohh ok thats good ...I am trying ..But the writing syntax is very difficult in scala – Sudarshan kumar Oct 09 '17 at 08:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/156249/discussion-between-sudarshan-and-shaido). – Sudarshan kumar Oct 09 '17 at 08:14
  • Hi just one modification i need in my final data frame ..In case i want header to include in my csv output file then i get only one header as `concatenated` is there any way to put header as my first data frame schema header which is `df1result` ? If required i can create another question for this – Sudarshan kumar Oct 09 '17 at 09:16
  • @SUDARSHAN Yes, it would probably be better to create a new one. Then other people can weigh in on the best method and future persons with the same problem can easier find an answer. :) – Shaido Oct 09 '17 at 09:25
  • Yea sure i have created one ..https://stackoverflow.com/questions/46643166/how-to-change-header-in-spark-dataframe-with-another-data-frame-header – Sudarshan kumar Oct 09 '17 at 09:33
0

option("nullValue", "whatever") checks if there is any column value "whatever" and treats that column value as null in the dataframe.

Just use the option during read and you will be fine.

 Dataset<Row> df = spark.read().format("csv")
              .option("nullValue", "NULL")      // this config does the trick
              .option("sep", ",")
              .schema(structType)
              .load(filePath);