Spark unable to parse csv with weird triple quotes

Question

So I have a csv file with 16 fields, and these two records in particular are unable to be correctly parsed

1,"X","X",,"Y ""Y"", Y, Y","Y,Y,Y,Y,Y,Y,Y,Y,Y",,,,,,"X",,,,"X"
2,"X","X",,"""Y"" Y, Y","Y,Y,Y,Y",,,,,,"X","X",,,"X"

Expected deliminators -

1|"X"|"X"||"Y ""Y"", Y, Y"|"Y,Y,Y,Y,Y,Y,Y,Y,Y"||||||"X"||||"X"
2|"X"|"X"|"""Y"" Y, Y"|"Y,Y,Y,Y"||||||"X"|"X"|||"X"

Now for example, "Y,Y,Y,Y,Y,Y,Y,Y,Y" this field is being correctly parsed to a single column, but """Y"" Y, Y" and "Y ""Y"", Y, Y" are failing. Is there anyway to correct this when using spark to read from a csv? Some option? I can use?

Note - the incoming data can not be changed in anyway, so escaping double quotes in the landing data is not an option.

Does [this answer](https://stackoverflow.com/a/45138591/2707792) (with [this](https://stackoverflow.com/a/49354838/2707792) correction) fully address your problem? — Andrey Tyukin, Aug 29 '18 at 14:52
I see that both the answer and the correction are same. Could you please confirm ? — Susheel Javadi, Feb 26 '21 at 11:58

score 1 · Answer 1 · answered Aug 29 '18 at 15:00

1

I tried like below and it’s working

spark.read.format(“csv”).load(“path”).show

answered Aug 29 '18 at 15:00

Chandan Ray

2,031
1
10
15

You tried it on what spark version? – Andrey Tyukin Aug 29 '18 at 15:02
1

Correct, however, this does not account for actual malformed data. For example `1|x|2` would load fine into that, but is actually malformed because it's missing columns. The other anwser provided solved my issue `.option("quote", "\"").option("escape", "\"")` – test acc Aug 29 '18 at 15:03
It’s Spark 2.2 version – Chandan Ray Aug 29 '18 at 15:05

Spark unable to parse csv with weird triple quotes

1 Answers1