I am trying to read a pipe delimited csv file into a spark dataframe. I have the double quotes ("") and pipe in some of the fields which is appearing more than once in that particular field and i want to escape it. can anyone let me know how can i do this?. since double quotes and pipe is used in the parameter list for options method, i don't know how to escape double quotes and pipe in the values present under col 2, such that the values didn't get copied to different columns.
Assume you have a file /tmp/test.csv" like
|col1|col2|col3|
||"| BLOCK "C" | IDA | chhatisgarh Mumbai | | VISAK"|"37AA57D3ZX"|
expecting output as :
+----+--------------------------------------------------+---------------+
|Col1|Col2 | Col3 |
+----+--------------------------------------------------+-----+---------+
| | |BLOCK "C" | IDA | chattisgarh Mumbai | | VISAK|37AA57D3ZX|
+----+--------------------------------------------------+---------------+
What I did:
val csvfile = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("escape", " ").option("multiline","true").load(filepath)
Output from above command:
+----+--------------------------------------------------+---------------+
|Col1|Col2 | Col3 |
+----+--------------------------------------------------+-----+---------+
| | |BLOCK "C |IDA |
+----+--------------------------------------------------+---------------+