1

I have a csv with around 15 columns

  1. I would like to skip first 2 lines and use a custom schema
  2. Remove double quotes from row values

csv is as below.

Header1 blah blah
Header2 blah blah
Name1;"1,456";"City1";"3";"pet"
Name2;"3,450";"City2";"4";"not pet"


delimiter = ";"
salesDF =  spark.read.format("csv") \
     .option("quote", "") \
     .option("sep", delimiter) \     
     .load("sales_2018.csv") 
salesDF = salesDF.replace("\"","")

I tried as above to remove quotes from csv. Delimiter works but quotes are not getting removed.

Results are as below: It has added only quotes but didn't remove.

Header1 blah blah
Header2 blah blah
"Name1;""1,456"";""City1"";""3"";""pet""
"Name2;""3,450"";""City2"";""4"";""not pet""

My idea is to remove quotes and the remove the first 2 lines of the dataframe to add my custom schema. Thanks.

Lilly
  • 910
  • 17
  • 38
  • Similar thread [here](https://stackoverflow.com/questions/44077404/how-to-skip-lines-while-reading-a-csv-file-as-a-dataframe-using-pyspark) – SA2010 Apr 08 '21 at 15:14

0 Answers0