Escape New line character in Spark CSV read

Question

I'm working on Spark 2.2.1 version and using the below python code, I can able to escape special characters like @ : I want to escape the special characters like newline(\n) and carriage return(\r). I replaced the @ which \n, however it didn't worked. Any suggestions please.

Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="@")

Not Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="\n")

parisni · Accepted Answer · 2018-02-19T08:00:28.657

24

If your goal is to read csv having textual content with multiple newlines in it, then the way to go is using the spark multiline option.

I recently posted some code for scala there.

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")

The python syntax will be slightly different but shoud work well.

edited Feb 19 '18 at 08:00

answered Feb 15 '18 at 08:12

parisni

920
7
20

Could you please point me out to the right link where I can see Scala code – data_addict Feb 19 '18 at 04:26
I added it in the answer – parisni Feb 19 '18 at 08:00
Thanks. Requirement is to remove the new lines and is achieved using above answer. I can't upvote your answer as I don't have enough reputation points. – data_addict Feb 19 '18 at 08:03
1

Thanks. However the above answer use panda, not spark. This won't scale for huge csv. – parisni Feb 19 '18 at 10:14
2

I have spark2.2 , but still option multiLine doesn't work. – Usman Azhar Apr 02 '18 at 00:00
1

Even with multiLine as True, there is still an issue with the \r\n combination where the \r can be retained. Using a read statement as above, I have seen it where the rightmost column header ended in \r and some of the rows in that last column also ended with \r. – demongolem Nov 18 '20 at 13:52
I'm seeing exactly as @demongolem suggests. I get a /r on one of my columns when I use databricks autoloader. I can't seem to find the way to fix this. – leeprevost Jun 26 '21 at 22:29
@leeprevost spark3 has a *linesep* parameter for csv dataframe reader. – parisni Jun 27 '21 at 20:25

score -1 · Answer 2 · answered Feb 19 '18 at 04:30

-1

You can achieve this using pandas.

Sample Code:

pandas_df = pd.read_csv("file.csv")
pandas_df = pandas_df.replace({r'\\r': ''}, regex=True)
pandas_df = pandas_df.replace({r'\\n': ''}, regex=True)

You can replace any special character with the above code snippet.

Later on you can convert the pandas_df to spark_df as needed.

spark_df = sqlContext.createDataFrame(pandas_df)

answered Feb 19 '18 at 04:30

data_addict

816
3
15
32

11

This only works if you can actually read the entire file on your local machine with pandas. – klucar Mar 06 '20 at 21:13

Escape New line character in Spark CSV read

2 Answers2

Linked