0

I am trying to query a table in remote (on-prem) hive database from my laptop. I am using spark sql. I am able to connect to it and retrieve the latest partition.

But however, when i try to retrieve a column (lets say pid), it throws below error:

19/10/08 15:01:19 ERROR Table: Unable to get field from serde: org.apache.hadoop.hive.serde2.avro.AvroSerDe
java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: maprfs:/user/<database_name>//<table_name>/schema/partition_epoch=<partition_id>/xyz.avsc)

Caused by: java.net.MalformedURLException: unknown protocol: maprfs
    at java.net.URL.<init>(URL.java:600)
    at java.net.URL.<init>(URL.java:490)
    at java.net.URL.<init>(URL.java:439)
    at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException

I tried using describe table command and attempted to print schema

 Dataset<Row> descTable = spark.sql("desc db.tablename");
 descTable.printSchema();

The schema that was printed seemed off and does not list any fields. It rather it prints out headers describing the fields

root
|-- col_name: string (nullable = false)
|-- data_type: string (nullable = false)
|-- comment: string (nullable = true)

I am expecting something like this

pid  string   from deserializer

The code ultimately fails when i query with explicit fields

19/10/08 15:01:25 WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<partition_epoch:string>) is different from the schema when this table was created by Spark SQL(struct<all fields and their type>,partition_epoch:string>). We have to fall back to the table schema from Hive metastore which is not case preserving.
19/10/08 15:01:25 ERROR: exception: cannot resolve '`pid`' given input columns: [db.tablename.partition_epoch]; line 1 pos 7;
cannot resolve '`pid`' given input columns: [db.tablename.partition_epoch]; line 1 pos 7;
'Project ['pid]
+- Filter (cast(partition_epoch#9 as int) = 1570500000)
   +- SubqueryAlias `db`.`tablename`
      +- HiveTableRelation `db`.`tablename`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [partition_epoch#9]

The following is a code that i am using to create SparkSession and query the table

SparkSession spark = SparkSession
            .builder()
            .master("local[*]")
            .appName("myApp")
    .config("spark.hadoop.javax.jdo.option.ConnectionURL","jdbc:mysql://url:3306/metastore?createDatabaseIfNotExist=false")
      .config("spark.hadoop.javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver")
            .config("spark.hadoop.javax.jdo.option.ConnectionUserName","username")
            .config("spark.hadoop.javax.jdo.option.ConnectionPassword","password")
            .config("hive.metastore.warehouse.dir", "hive")
            .enableHiveSupport()
            .getOrCreate();

Dataset<Row> productIds = spark.sql("select pid FROM db.tablename WHERE partition_epoch="+partitionEpoch);
System.out.println(productIds.collect());

I looked for hivemeta.uris in hive-site.xml under etc/hive/conf but it did not have this info.

How can I resolve the schema error and query the table.?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
sha256
  • 1
  • 1
  • 3

1 Answers1

0

descTable.printSchema() describes schema of dataframe and not a schema of your table. use descTable.show() to see your schema.

to validate issue check table properties attribute spark.sql.sources.schema. Schema of table must be the same as described by this attribute.