0

I use following method to read a Parquet file in Spark

scala> val df = spark.read.parquet("hdfs:/ORDER_INFO")
scala> df.show()

When I show content of DataFrame it displays with encoded language like below

[49 4E 53 5F 32 33]
[49 4E 53 5F 32 30]

In actual scenario these are Strings. Can anyone suggest a method to overcome this issue.

Shan
  • 31
  • 4

1 Answers1

1

Is your input file encoded? Have you tried this, if this works for you?

spark.read.option("encoding","UTF-8").parquet("hdfs:/ORDER_INFO")
  • I have tried it now but still the same result given. No its not encoded. I have used impala to read same table it gives the output correctly. – Shan Oct 17 '22 at 11:58
  • Sorry my mistake parquet file is encoded and also this issue is occur only for Strings not for integers, long and double values – Shan Oct 17 '22 at 12:25
  • Could you please tell me, Which encoding is being used to encode the parquet files? – Vikram Patel Oct 17 '22 at 15:40
  • Encoding is happened through impala. Using its default encode. Problem is I have no idea about that encode. – Shan Oct 18 '22 at 17:24
  • Got It. In that case you can use impala jdbc connector in Spark. It should solve the issue. – Vikram Patel Oct 19 '22 at 04:46
  • Impala JDBC connector in Spark is work fine with it. But the issue is we cannot use it. It makes a performance error in our system. Do you have any other known method? – Shan Oct 19 '22 at 14:25
  • 1
    Solved this issue. It is related to Impala and run below command before creation of table solve this issue. `SET PARQUET_ANNOTATE_STRINGS_UTF8=1;` – Shan Oct 19 '22 at 17:48