I have a csv where a column sometimes contains a new line character (\n or \r), I need to parse this file into a dataframe ignoring or removing those characters BUT these values are NOT surrounded by quotes other wise I could simply add .option("multiline",true)
similar question with values surrounded by quotes: Escape New line character in Spark CSV read
Sample Code:
val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")
sample input:
id,commment,name
1,good,bob
2,bad
,tim
3,fine,sarah
sample output:
id | comment | name |
---|---|---|
1 | good | bob |
2 | bad | null |
tim | null | null |
3 | fine | sarah |
desired output:
id | comment | name |
---|---|---|
1 | good | bob |
2 | bad | tim |
3 | fine | sarah |
edit table formatting