-2

I am trying to read UTF-16 file using pyspark dataframe. While reading, if there is a space in the file, it is showing as box while displaying using df.display(). How to read this properly?

df = spark.read.option("delimiter","|") \
        .option("header","True") \
        .option("encoding", "UTF-16") \
        .option("multiline",'True') \
        .csv(f"<<path>>")

Error Screenshot: Space in file while reading through dataframe

Rathesh
  • 1
  • 3
  • 1
    Does all of the remaining data show up as desired? Then my guess is that the "space" isn't just a traditional U+0020, but some other unicode space (or even a character that's not rendered visibly but is not really a space). Have you inspected the input file with a hexeditor to investigate? – Joachim Sauer Jul 26 '23 at 11:49
  • Let me check it with Hexeditor – Rathesh Jul 27 '23 at 02:01

1 Answers1

0

You have to use this syntax

.option("encoding", "UTF-16")
shalnarkftw
  • 402
  • 2
  • 8