0

I'm using scala as programming language in my azure databricks notebook, where my dataframe giving me accurate result, but when I'm trying to store the same in csv it shifting the cell where comma(,) is coming

spark.sql("""
  SELECT * FROM invalidData
  """).coalesce(1)
      .write
      .option("header", "true")
      .format("com.databricks.spark.csv")
      .mode("overwrite")
      .save(s"$dbfsMountPoint/invalid/${fileName.replace(".xlsx", ".csv")}")

Here one column having data like 256GB SSD, Keyb.:, so while writing it using above function it show string after comma(,) in another cell. Any spark inbuilt solution appriciated...

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
Manish Jain
  • 217
  • 1
  • 4
  • 16
  • Spark should automatically quote values that contain the separator character (you can change the quote character if you like https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option). But the software you use to read this csv file should be configured to use the same quote character as you used for writing. – Jasper-M Oct 25 '21 at 10:07

1 Answers1

0

As @Jasper-M pointed out you can write the output csv with a custom separator.

In this example we use | as the separator:

spark.sql("""
  SELECT * FROM invalidData
  """).coalesce(1)
      .write
      .option("header", "true")
      .format("com.databricks.spark.csv")
      .option("sep", "|")
      .mode("overwrite")
      .save(s"$dbfsMountPoint/invalid/${fileName.replace(".xlsx", ".csv")}")

It is worth noting that the save method takes in a path to save to and not the filename itself. A .csv file (1 file since you set .coalesce(1)`) will be saved under this path, treating this input as a directory.

To read the .csv back in, using spark:

spark.read.format("com.databricks.spark.csv")
      .option("inferSchema", "true")
      .option("sep","|")
      .option("header", "true")
      .load(s"$dbfsMountPoint/invalid/${path}")
tjheslin1
  • 1,378
  • 6
  • 19
  • 36