3

Hope everyone is doing well.

While going through the spark csv datasource options for the question I am quite confused on the difference between the various quote related options available.

spark-csv-quote-options

  1. Do we have any detailed differences between them ?
  2. Does any option override the other or they all work together ?

Used the example mentioned in the linked question to understand the differences, but still a little confused. Thank you for all the help.

rainingdistros
  • 450
  • 3
  • 11

1 Answers1

4

Let's run through a few examples.

val df = Seq(
  (1, "aaa"),
  (2, "b,c"),
  (3, "d$e"),
  (4, "f%g")
).toDF("x", "y")
df.show

+---+---+
|  x|  y|
+---+---+
|  1|aaa|
|  2|b,c|
|  3|d$e|
|  4|f%g|
+---+---+

Whenever you have commas in values that would be indistinguishable from field delimiters (like 2,b,c) Spark can quote such values. It's double quotes by default (like 2,"b,c"), but you can customize it using quote option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .csv("x")

1,aaa
2,$b,c$
3,$d\$e$
4,f%g

We used dollar sign here. Not only b,c was quoted, but since d$e contained a dollar, it was quoted as well and the dollar itself was escaped.

I don't know why you would like to do that, but you can ask Spark not to escape fields containing quote character using escapeQuotes option.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escapeQuotes", false)
  .csv("x")

1,aaa
2,$b,c$
3,d$e
4,f%g

Here, d$e remained unquoted.

You can also force it to quote every field using quoteAll.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("quoteAll", true)
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d\$e$
$4$,$f%g$

Note that quoteAll = true makes escapeQuotes irrelevant.

You can customize the escape character for the quote character inside quote using escape, if you don't like the default backslash.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .csv("x1")

1,aaa
2,$b,c$
3,$d%$e$
4,$f%g$

Finally, note that there's a standalone escape character in f%g, so if you would like to escape that, use charToEscapeQuoteEscaping.

df.coalesce(1).write.mode("overwrite")
  .option("quote", "$")
  .option("escape", "%")
  .option("quoteAll", true)
  .option("charToEscapeQuoteEscaping", "^")
  .csv("x1")

$1$,$aaa$
$2$,$b,c$
$3$,$d%$e$
$4$,$f^%g$

No, you can't go deeper and escape ^...

Kombajn zbożowy
  • 8,755
  • 3
  • 28
  • 60
  • extra ordinary explanation... – rainingdistros Nov 14 '22 at 12:44
  • Is below summary accurate ? `quote` - enclose string that contains the delimiter i.e. comma in a csv `escape` - when the quote character is part of string, it is escaped with escape character `escapeQuote` - when the quote character is part of string, it is escaped with escape character, escapeQuote is used to ignore it. `quoteAll` - quote all the fields irrespective of whether they contain a delimiter or other special characters. `charToEscapeQuoteEscaping` - when the escape character is part of string, it is escaped with charToEscapeQuoteEscaping quoteAll = true overrides escapeQuotes – rainingdistros Nov 14 '22 at 12:46
  • Also does quote work just on delimiter or even on other special characters such as \r, \n, \000, \001 etc ? – rainingdistros Nov 14 '22 at 12:47
  • 1
    Yes, that's quite accurate. A value containing \n is quoted, because that's a record delimiter. Anything else than quote character, record and field delimiter is not. – Kombajn zbożowy Nov 16 '22 at 07:11