3

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.

In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x

Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.

I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.

Questions:

  • How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?

  • How do I reference those views from within a Glue job?

What I've tried so far: using boto3 to call glue.create_table like this

glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name', 
   TableInput = {'Name': 'hive_view', 
       'TableType': 'VIRTUAL_VIEW',
       'ViewExpandedText': 'select .... from ...'
    })

I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:

py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. : 
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...

I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.

wrschneider
  • 17,913
  • 16
  • 96
  • 176
  • are you by any chance looking for this https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html ? – eliasah Jan 29 '19 at 15:33
  • 1
    I have already seen that doc and I am m aware that I can point an EMR cluster's Hive metastore at Glue. But what I'm trying to do is reference Hive views from within a Glue job directly. – wrschneider Feb 02 '19 at 12:34
  • @wrschneider Did you ever find a solution to this? – Thomas Larsson Apr 16 '19 at 13:13
  • I haven't looked at this recently but I think there are some changes that allow Glue jobs to use objects in the Glue catalog directly from Spark as one would in EMR - see my other question https://stackoverflow.com/questions/54596569/tables-not-found-in-spark-sql-after-migrating-from-emr-to-aws-glue – wrschneider Apr 16 '19 at 17:43

1 Answers1

1

You can create a temporary view in Spark and query it like a Hive table (Scala):

val dataDyf = glueContext.getSourceWithFormat(
      connectionType = "s3",
      format = "parquet",
      options = JsonOptions(Map(
        "paths" -> Array("s3://bucket/external/folder")
      ))).getDynamicFrame()

// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)

dataViewDf.createOrReplaceTempView("hive_view")

val df = spark.sql("select * from hive_view")
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49
  • 1
    I'm not asking about temporary views. I'm asking about persistent Hive views, which can be queried in Spark like Hive tables. – wrschneider Feb 07 '19 at 02:14
  • Temp view gives a possibility to achieve the same result as using external hive table. So yes, it's a workaround for you problem but not exact solution. – Yuriy Bondaruk Feb 08 '19 at 19:20