How to import Pipe Delimited CSV file into spark dataframe where individual fields have multiple | and " special character in the values

Question

I am trying to read a pipe delimited csv file into a spark dataframe. I have the double quotes ("") and pipe in some of the fields which is appearing more than once in that particular field and i want to escape it. can anyone let me know how can i do this?. since double quotes and pipe is used in the parameter list for options method, i don't know how to escape double quotes and pipe in the values present under col 2, such that the values didn't get copied to different columns.

Assume you have a file /tmp/test.csv" like

|col1|col2|col3|
||"|  BLOCK "C" |  IDA  | chhatisgarh Mumbai  |  |  VISAK"|"37AA57D3ZX"|

expecting output as :

+----+--------------------------------------------------+---------------+

|Col1|Col2                                              | Col3          |

+----+--------------------------------------------------+-----+---------+

|    |  |BLOCK "C" |  IDA  | chattisgarh Mumbai  |  |  VISAK|37AA57D3ZX|

+----+--------------------------------------------------+---------------+

What I did:

val csvfile = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("escape", " ").option("multiline","true").load(filepath)

Output from above command:

+----+--------------------------------------------------+---------------+

|Col1|Col2                                              | Col3          | 
 
+----+--------------------------------------------------+-----+---------+

|    |  |BLOCK "C                                       |IDA            |

+----+--------------------------------------------------+---------------+

Double quotes in Scala strings are inserted using `\"`, e.g. `"string with \"quotes\" in it"`. Another option is to use triple quotes: `"""string with "quotes" in it"""`. — Hristo Iliev, Sep 20 '21 at 09:43
I agree, but if i am getting source data in single double quote format then how to handle such scenario? — Monalisa Jena, Sep 20 '21 at 19:19
You need to have the data in format that allows for unambiguous parsing. As it currently stands, there is no way to know that the quotes in`"C"` are both part of a larger string. Therefore, the CSV itself must be written with internal quotes escaped, i.e., `||"| BLOCK \"C\" ..."`. Provided that this is the case, you can configure the reader to use `"\\"` as escape character and it will parse the string correctly. — Hristo Iliev, Sep 23 '21 at 12:59
See [this answer](https://stackoverflow.com/a/45138591/1374437) for more information. — Hristo Iliev, Sep 23 '21 at 13:04

How to import Pipe Delimited CSV file into spark dataframe where individual fields have multiple | and " special character in the values

0 Answers0