Parse Windows CSV on Linux in Apache Spark

Question

I am trying to parse a CSV file coming from a Windows machine on a Linux machine using Apache Spark, but accentuated characters do not seem to be recognized...

Dataset<Row> df = spark
    .read()
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("file.csv");

score 1 · Answer 1 · answered Feb 03 '19 at 20:28

1

Looks like you're almost there. Try:

Dataset<Row> df = spark
    .read()
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("encoding", "cp1252")
    .load("file.csv");

You can specify the encodingas an option. For Windows, it's cp1252.

answered Feb 03 '19 at 20:28

jgp

2,069
1
21
40

1

cp1252 = Windows? – Laura Webster Feb 03 '19 at 20:30
Most likely... Specially if your file comes from Excel. – jgp Feb 03 '19 at 20:31

score 0 · Answer 2 · answered Feb 03 '19 at 21:17

0

Another way is to run a dos2unix command on the file from inside a Terminal once it is brought to linux.

dos2unix <file_name>

This will make sure that the carriage return characters to be removed from the file and it will become linux friendly.

answered Feb 03 '19 at 21:17

Sai Nikhil

1,237
2
15
39

Parse Windows CSV on Linux in Apache Spark

2 Answers2