Spark SQL - Runtime Exception while determining schema

Question

I am trying to query a table in remote (on-prem) hive database from my laptop. I am using spark sql. I am able to connect to it and retrieve the latest partition.

But however, when i try to retrieve a column (lets say pid), it throws below error:

19/10/08 15:01:19 ERROR Table: Unable to get field from serde: org.apache.hadoop.hive.serde2.avro.AvroSerDe
java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: maprfs:/user/<database_name>//<table_name>/schema/partition_epoch=<partition_id>/xyz.avsc)

Caused by: java.net.MalformedURLException: unknown protocol: maprfs
    at java.net.URL.<init>(URL.java:600)
    at java.net.URL.<init>(URL.java:490)
    at java.net.URL.<init>(URL.java:439)
    at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException

I tried using describe table command and attempted to print schema

 Dataset<Row> descTable = spark.sql("desc db.tablename");
 descTable.printSchema();

The schema that was printed seemed off and does not list any fields. It rather it prints out headers describing the fields

root
|-- col_name: string (nullable = false)
|-- data_type: string (nullable = false)
|-- comment: string (nullable = true)

I am expecting something like this

pid  string   from deserializer

The code ultimately fails when i query with explicit fields

19/10/08 15:01:25 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<partition_epoch:string>) is different from the schema when this table was created by Spark SQL(struct<all fields and their type>,partition_epoch:string>). We have to fall back to the table schema from Hive metastore which is not case preserving.
19/10/08 15:01:25 ERROR: exception: cannot resolve '`pid`' given input columns: [db.tablename.partition_epoch]; line 1 pos 7;
cannot resolve '`pid`' given input columns: [db.tablename.partition_epoch]; line 1 pos 7;
'Project ['pid]
+- Filter (cast(partition_epoch#9 as int) = 1570500000)
   +- SubqueryAlias `db`.`tablename`
      +- HiveTableRelation `db`.`tablename`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [partition_epoch#9]

The following is a code that i am using to create SparkSession and query the table

SparkSession spark = SparkSession
            .builder()
            .master("local[*]")
            .appName("myApp")
    .config("spark.hadoop.javax.jdo.option.ConnectionURL","jdbc:mysql://url:3306/metastore?createDatabaseIfNotExist=false")
      .config("spark.hadoop.javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver")
            .config("spark.hadoop.javax.jdo.option.ConnectionUserName","username")
            .config("spark.hadoop.javax.jdo.option.ConnectionPassword","password")
            .config("hive.metastore.warehouse.dir", "hive")
            .enableHiveSupport()
            .getOrCreate();

Dataset<Row> productIds = spark.sql("select pid FROM db.tablename WHERE partition_epoch="+partitionEpoch);
System.out.println(productIds.collect());

I looked for hivemeta.uris in hive-site.xml under etc/hive/conf but it did not have this info.

How can I resolve the schema error and query the table.?

Nikita Rogatov · Answer 1 · 2020-09-24T12:22:46.640

0

descTable.printSchema() describes schema of dataframe and not a schema of your table. use descTable.show() to see your schema.

to validate issue check table properties attribute spark.sql.sources.schema. Schema of table must be the same as described by this attribute.

edited Sep 24 '20 at 12:22

answered Sep 24 '20 at 08:23

Nikita Rogatov

1
1

Spark SQL - Runtime Exception while determining schema

1 Answers1