1

So I have a csv file with 16 fields, and these two records in particular are unable to be correctly parsed

1,"X","X",,"Y ""Y"", Y, Y","Y,Y,Y,Y,Y,Y,Y,Y,Y",,,,,,"X",,,,"X"
2,"X","X",,"""Y"" Y, Y","Y,Y,Y,Y",,,,,,"X","X",,,"X"

Expected deliminators -

1|"X"|"X"||"Y ""Y"", Y, Y"|"Y,Y,Y,Y,Y,Y,Y,Y,Y"||||||"X"||||"X"
2|"X"|"X"|"""Y"" Y, Y"|"Y,Y,Y,Y"||||||"X"|"X"|||"X"

Now for example, "Y,Y,Y,Y,Y,Y,Y,Y,Y" this field is being correctly parsed to a single column, but """Y"" Y, Y" and "Y ""Y"", Y, Y" are failing. Is there anyway to correct this when using spark to read from a csv? Some option? I can use?

Note - the incoming data can not be changed in anyway, so escaping double quotes in the landing data is not an option.

test acc
  • 561
  • 2
  • 11
  • 24
  • 1
    Does [this answer](https://stackoverflow.com/a/45138591/2707792) (with [this](https://stackoverflow.com/a/49354838/2707792) correction) fully address your problem? – Andrey Tyukin Aug 29 '18 at 14:52
  • I see that both the answer and the correction are same. Could you please confirm ? – Susheel Javadi Feb 26 '21 at 11:58

1 Answers1

1

I tried like below and it’s working

spark.read.format(“csv”).load(“path”).show

Chandan Ray
  • 2,031
  • 1
  • 10
  • 15