14

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.

AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html


Observations:

- Using spark-shell (From EMR Master Node):

  • Works. Able to access Glue DB/Tables using below commands:
    spark.catalog.setCurrentDatabase("test_db")
    spark.catalog.listTables
    

- Using spark-submit (From EMR Step):

  • Does not work. Keep getting the error "Database 'test_db' does not exist"

Error Trace is as below:

INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.; at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44) at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64) at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97) at org.griffin_test.GriffinTest.main(GriffinTest.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)


After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.

Reference Blogs:

Fixes Tried:

- Enabling Hive support in spark-defaults.conf & SparkSession (Code):

  • Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:

    spark.sql.catalogImplementation  hive
    
  • Adding Hive metastore config:

    .config("hive.metastore.connect.retries", 15)
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    

Code Snippet:

SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
                        .config("spark.sql.catalogImplementation", "hive")
                        .config("hive.metastore.connect.retries", 15) 
                        .config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
                        .enableHiveSupport()
                        .getOrCreate();

Any suggestions in figuring out the root cause for this discrepancy would be really helpful.

Appreciate your help! Thank you!

Sridher
  • 201
  • 3
  • 11
  • 3
    Were you ever able to figure out this issue or develop a workaround? I'm experiencing what looks to be the exact same issue. – Bryan Lott Jan 29 '19 at 19:42
  • 1
    @BryanLott I got in touch with the AWS support folks and figured out that this discrepancy is a known bug with EMR v5.11.1 even though they claim EMR + Glue combo works from v5.10.0. When I tried on the latest EMR v5.21.0, it worked flawlessly. – Sridher Mar 05 '19 at 18:34
  • 1
    @Sridher I also contacted the AWS support and they could make it work with Spark 2.3.0 which implies the use of EMR 5.15.0, but I'm not able to make it work with 5.21.0. Is it possible that you can provide a sample Git project? When trying to run it in 5.23.0 I get `java.lang.NoSuchFieldError: INSTANCE;` errors. – Gonzalo Mar 05 '19 at 21:29
  • I got it to work on Spark 2.4.0 by setting my project dependencies to Spark 2.3.0 instead. Also be sure to remove the spark-hive dependency. The issue I mentioned in my previous message is being checked through an internal aws ticket. – Gonzalo Mar 08 '19 at 13:30
  • 1
    @Gonzalo hi how did you solve the error org.apache.spark.sql.AnalysisException: java.lang.NoSuchFieldError: INSTANCE; I am also stuck in same issue – AbhiK Aug 17 '20 at 13:13
  • @AbhiK Try compiling with Oracle's java instead of OpenJdk. – Gonzalo Aug 18 '20 at 17:38

0 Answers0