Reading Parquet file from Spark

Question

I use following method to read a Parquet file in Spark

scala> val df = spark.read.parquet("hdfs:/ORDER_INFO")
scala> df.show()

When I show content of DataFrame it displays with encoded language like below

[49 4E 53 5F 32 33]
[49 4E 53 5F 32 30]

In actual scenario these are Strings. Can anyone suggest a method to overcome this issue.

score 1 · Answer 1 · answered Oct 17 '22 at 11:30

1

Is your input file encoded? Have you tried this, if this works for you?

spark.read.option("encoding","UTF-8").parquet("hdfs:/ORDER_INFO")

answered Oct 17 '22 at 11:30

Vikram Patel

I have tried it now but still the same result given. No its not encoded. I have used impala to read same table it gives the output correctly. – Shan Oct 17 '22 at 11:58
Sorry my mistake parquet file is encoded and also this issue is occur only for Strings not for integers, long and double values – Shan Oct 17 '22 at 12:25
Could you please tell me, Which encoding is being used to encode the parquet files? – Vikram Patel Oct 17 '22 at 15:40
Encoding is happened through impala. Using its default encode. Problem is I have no idea about that encode. – Shan Oct 18 '22 at 17:24
Got It. In that case you can use impala jdbc connector in Spark. It should solve the issue. – Vikram Patel Oct 19 '22 at 04:46
Impala JDBC connector in Spark is work fine with it. But the issue is we cannot use it. It makes a performance error in our system. Do you have any other known method? – Shan Oct 19 '22 at 14:25
1

Solved this issue. It is related to Impala and run below command before creation of table solve this issue. `SET PARQUET_ANNOTATE_STRINGS_UTF8=1;` – Shan Oct 19 '22 at 17:48

1 Answers1