1

I am trying to parse a CSV file coming from a Windows machine on a Linux machine using Apache Spark, but accentuated characters do not seem to be recognized...

Dataset<Row> df = spark
    .read()
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("file.csv");
Laura Webster
  • 21
  • 1
  • 6

2 Answers2

1

Looks like you're almost there. Try:

Dataset<Row> df = spark
    .read()
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("encoding", "cp1252")
    .load("file.csv");

You can specify the encodingas an option. For Windows, it's cp1252.

jgp
  • 2,069
  • 1
  • 21
  • 40
0

Another way is to run a dos2unix command on the file from inside a Terminal once it is brought to linux.

dos2unix <file_name>

This will make sure that the carriage return characters to be removed from the file and it will become linux friendly.

Sai Nikhil
  • 1,237
  • 2
  • 15
  • 39