aws glue HiveContext access glue DataCatalog

Question

I can read a table, defined in the glue data catalogue from a glue job with the glueContext. However, if I want to read the exact same table with hiveContext, I receive an error message stating that it cannot find that table.

In my opinion the HiveContext cannot access the glue data catalog.

Do you know what to insert in the glue job configuration (edit job -> job parameters -> "--conf xyz") to make sure that the HiveContext can find and access tables in the glue data catalogue?

I'd like to execute the following code:

# import libs    
from pyspark.context import SparkContext    
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import HiveContext

# create sparkContext and HiveContext
sc = SparkContext() 
hc = HiveContext(sc)

# read table from glue data catalogue
df = hc.table('glue_db.glue_table').persist()

The code above returns the following error message:

pyspark.sql.utils.AnalysisException: u"Table or view not found: glue_db.glue_table;;\n'UnresolvedRelation glue_db.glue_table\n"

I have tried the spark versions spark2.2 and spark2.4

Many thanks in advance!

I had troubles like this with the latest versions of EMR. I had to go back to 5.26.0. I'm not sure it's your case here but give it a try. — eliasah, Feb 10 '20 at 18:00

Emerson · Answer 1 · 2020-02-25T10:18:03.400

2

Try this

from awsglue.context import GlueContext

glueContext = GlueContext(sc)
spark = glueContext.spark_session
df= spark.sql(“select * from glue.table”)

Or just directly start by creating ur spark session and bypass glue completely.

As long as You have checked the box that allows the glue catalog to be used as a hive metastore

edited Feb 25 '20 at 10:18

answered Feb 25 '20 at 10:12

Emerson

1,136
1
6
9

aws glue HiveContext access glue DataCatalog

1 Answers1