7

I'm working on Spark 2.2.1 version and using the below python code, I can able to escape special characters like @ : I want to escape the special characters like newline(\n) and carriage return(\r). I replaced the @ which \n, however it didn't worked. Any suggestions please.

Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="@")

Not Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="\n")
data_addict
  • 816
  • 3
  • 15
  • 32

2 Answers2

24

If your goal is to read csv having textual content with multiple newlines in it, then the way to go is using the spark multiline option.

I recently posted some code for scala there.

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")

The python syntax will be slightly different but shoud work well.

parisni
  • 920
  • 7
  • 20
  • Could you please point me out to the right link where I can see Scala code – data_addict Feb 19 '18 at 04:26
  • I added it in the answer – parisni Feb 19 '18 at 08:00
  • Thanks. Requirement is to remove the new lines and is achieved using above answer. I can't upvote your answer as I don't have enough reputation points. – data_addict Feb 19 '18 at 08:03
  • 1
    Thanks. However the above answer use panda, not spark. This won't scale for huge csv. – parisni Feb 19 '18 at 10:14
  • 2
    I have spark2.2 , but still option multiLine doesn't work. – Usman Azhar Apr 02 '18 at 00:00
  • 1
    Even with multiLine as True, there is still an issue with the \r\n combination where the \r can be retained. Using a read statement as above, I have seen it where the rightmost column header ended in \r and some of the rows in that last column also ended with \r. – demongolem Nov 18 '20 at 13:52
  • I'm seeing exactly as @demongolem suggests. I get a /r on one of my columns when I use databricks autoloader. I can't seem to find the way to fix this. – leeprevost Jun 26 '21 at 22:29
  • @leeprevost spark3 has a *linesep* parameter for csv dataframe reader. – parisni Jun 27 '21 at 20:25
-1

You can achieve this using pandas.

Sample Code:

pandas_df = pd.read_csv("file.csv")
pandas_df = pandas_df.replace({r'\\r': ''}, regex=True)
pandas_df = pandas_df.replace({r'\\n': ''}, regex=True)

You can replace any special character with the above code snippet.

Later on you can convert the pandas_df to spark_df as needed.

spark_df = sqlContext.createDataFrame(pandas_df)
data_addict
  • 816
  • 3
  • 15
  • 32
  • 11
    This only works if you can actually read the entire file on your local machine with pandas. – klucar Mar 06 '20 at 21:13