Issue in reading UTF-16 text file using Pyspark

Question

I am trying to read UTF-16 file using pyspark dataframe. While reading, if there is a space in the file, it is showing as box while displaying using df.display(). How to read this properly?

df = spark.read.option("delimiter","|") \
        .option("header","True") \
        .option("encoding", "UTF-16") \
        .option("multiline",'True') \
        .csv(f"<<path>>")

Error Screenshot: Space in file while reading through dataframe

Does all of the remaining data show up as desired? Then my guess is that the "space" isn't just a traditional U+0020, but some other unicode space (or even a character that's not rendered visibly but is not really a space). Have you inspected the input file with a hexeditor to investigate? — Joachim Sauer, Jul 26 '23 at 11:49

score 0 · Answer 1 · answered Jul 26 '23 at 13:02

0

You have to use this syntax

.option("encoding", "UTF-16")

answered Jul 26 '23 at 13:02

shalnarkftw

402
2
8

I actually mentioned as UTF-16 only, like how you mentioned. Here wrongly typed, updated it now. This is not working – Rathesh Jul 26 '23 at 15:12
you're sure that the input data are encoded in UTF-16 ? – shalnarkftw Jul 26 '23 at 15:40
I opened the file in notepad++. At the right side bottom, it was showing UTF-16 LE – Rathesh Jul 26 '23 at 16:23
Try to save the dataframe into a file and check the output encoding. It may be related to the display method – shalnarkftw Jul 26 '23 at 16:44
1

It worked. While saving it came properly – Rathesh Jul 27 '23 at 02:56
Cool please upvote and accept my answer if it helped you. – shalnarkftw Jul 27 '23 at 09:20

Issue in reading UTF-16 text file using Pyspark

1 Answers1