Read simple csv with PySpark

Question

probably a silly issue, but I don't get it. I'm working on a Jupyter Notebook with Python3.6, Spark 2.4, hosted by IBM Watson Studio.

I have a simple csv file:

num,label
0,0
1,0
2,0
3,0

And to read it I use the following commands:

labels = spark.read.csv(url, sep=',', header=True)

But if I check if labels is correct, using labels.head(), I get Row(PAR1Љ��L�Q�� ='\x08\x00]')

What am I missing?

score 1 · Accepted Answer · answered Jul 02 '20 at 18:07

1

This looks like due to an encoding issue

Try this with an encoding provided in the option,alo try with UTF-8

labels = spark.read.csv(url, sep=',', header=True).option("encoding", "ISO-8859-1")

answered Jul 02 '20 at 18:07

dsk

Indeed the ISO-8859-1 encoding did the job. However, stated like that do not work. I ran `labels = spark.read.csv(url, sep=',', header=True, encoding="ISO-8859-1")` – Vincenzo Lavorini Jul 03 '20 at 07:59

1 Answers1