11

Is there any way to run local master Spark SQL queries against AWS Glue?

Launch this code on my local PC:

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue

EDIT based on some examples it appears that the hive.metastore.uris configuration key should allow specifying a specific metastore url, however, it's not clear how to get the relevant value for glue

SparkSession.builder()
    .master("local")
    .enableHiveSupport()
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .config("hive.metastore.uris", "thrift://???:9083")
    .getOrCreate()
    .sql("show databases"); // this query isn't running against AWS Glue
Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106
VB_
  • 45,112
  • 42
  • 145
  • 293
  • I think that it isn't possible, for two reasons: 1) You can run the glue code by using UI, boto3, dev endpoints, you can also use AWS Glue Data Catalog in AWS EMR, but according to my knowledge that is all options. 2) the Glue service bases on such technologies as Hive or Spark, but it isn't pure version of these technologies, there are limitations and this service uses its own library. – jbgorski Sep 15 '18 at 15:21
  • @j.b.gorski Looks like our Glue serves only as metadata store, and it doesn't transform data. So instead of mocking data for integration tests I can mock Glue reader wih S3 reader and read data directly from S3 (enforcing the same schema). The only error-prone point here is enforcing schema on CSV dataset read from S3 – VB_ Sep 15 '18 at 22:09
  • 3
    @j.b.gorski What's strange: `session.catalog().listDatabases()` returns `default` database with Glue's description. Spark SQL also returns `default` when I'm doing `show databases`. But it does not see another Glue's databases – VB_ Sep 15 '18 at 22:10
  • 1
    did you manage to find a solution? – Ophir Yoktan Sep 15 '19 at 14:07
  • 1
    did you find a way to do this? – Ashish Mishra Jan 13 '22 at 07:43

1 Answers1

3

Amazon provide this client that should solve the problem. (didn't try it yet)

https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106