1

I have a dataframe to which i do concatenation to its all fields.

After concatenation it becomes another dataframe and finally I write its output to csv file with partitioned on two of its columns. One of its column is present in first dataframe which I do not want to include in the final output.

Here is my code:

val dfMainOutput = df1resultFinal.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
       when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
       when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
       when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
       .filter(!$"FFAction".contains("D"))

Here I am concatenating and creating another dataframe:

val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))     

This is what i have tried

dfMainOutputFinal
  .drop("DataPartition")
  .write
  .partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("header","true")
  .option("encoding", "\ufeff")
  .option("codec", "gzip")
  .save("path to csv")

Now i dont want DataPartition column in my output .

I am doing partition based on DataPartition so i am not getting but because DataPartition is present in the main data frame I am getting it in the output.

QUESTION 1: How can ignore a columns from Dataframe

QUESTION 2: Is there any way to add "\ufeff" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.

As per the suggested answer

This is what i have tried

 val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))

But getting below error

<console>:238: error: value fieldNames is not a member of Seq[org.apache.spark.sql.types.StructField]
               val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.filter(_ != "DataPartition").fieldNames.map(c => col(c)): _*).as("concatenated"))

Below is the question if i have to remove two columns in final output

  val dfMainOutputFinal = dfMainOutput.select($"DataPartition","PartitionYear",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition","PartitionYear").map(c => col(c)): _*).as("concatenated"))

2 Answers2

0

QUESTION 1: How can ignore a columns from Dataframe

Ans:

val df = sc.parallelize(List(Person(1,2,3), Person(4,5,6))).toDF("age", "height", "weight")

df.columns
df.show()



+---+------+------+
|age|height|weight|
+---+------+------+
|  1|     2|     3|
|  4|     5|     6|
+---+------+------+


val df_new=df.select("age", "height")
    df_new.columns
    df_new.show()

+---+------+
|age|height|
+---+------+
|  1|     2|
|  4|     5|
+---+------+

df: org.apache.spark.sql.DataFrame = [age: int, height: int ... 1 more field]
df_new: org.apache.spark.sql.DataFrame = [age: int, height: int]

QUESTION 2: Is there any way to add "\ufeff" in the csv output file before writing my actual data so that my encoding format will become UTF-8-BOM.

Ans:

 String path= "/data/vaquarkhan/input/unicode.csv";

 String outputPath = "file:/data/vaquarkhan/output/output.csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(path)
      .write()
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}
vaquar khan
  • 10,864
  • 5
  • 72
  • 96
  • I just dont want to select the column ,,I want to partition based on first data frame from which second data frame has been derived –  Oct 07 '17 at 17:15
0

Question 1:

The columns you use in df.write.partitionBy() will not be added to the final csv file. They are automatically ignored since the data is encoded in the file structure. However, if what you mean is to remove it from the concat_ws (and thereby from the file), it is possible to do with a small change:

concat_ws("|^|", 
  dfMainOutput.schema.fieldNames
    .filter(_ != "DataPartition")
    .map(c => col(c)): _*).as("concatenated"))

Here the column DataPartition is filtered away before the concatenation.

Question 2:

Spark does not seem to support UTF-8 BOM and it seems to cause problems when reading in files with the format. I can't think of any easy way to add the BOM bytes to each csv file other than writing a script to add them after Spark has finished. My recommendation would be to simply use normal UTF-8 formatting.

dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("header", "true")
  .option("encoding", "UTF-8")
  .option("codec", "gzip")
  .save("path to csv")

Additionally, according to the Unicode standard, BOM is not recommended.

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Shaido
  • 27,497
  • 23
  • 70
  • 73