I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as? Thank you
Asked
Active
Viewed 5,414 times
9

Akira Yamamoto
- 4,685
- 4
- 42
- 43

Nikitas Bompolias
- 113
- 1
- 7
-
The only workaround I could find so far is using the Filter class: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-filter.html In my case I am using the view to filter some data. – Akira Yamamoto Jan 21 '19 at 05:53
-
I found another way of using a SQL query in AWS Glue: https://stackoverflow.com/a/45824735/475876 – Akira Yamamoto Feb 07 '19 at 05:14
1 Answers
13
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.
- Download the driver and store the jar to an S3 bucket.
- Specify the S3 path to the driver as a dependent jar in your job definition.
- Load the data into a dynamic frame using the code below (using an IAM user with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
glueContext.read.format("jdbc")
.option("user", "[IAM user access key]")
.option("password", "[IAM user secret access key]")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "my_database.my_athena_view")
.option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
.load()
)
athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")
The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).
The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation
key.
I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.
Attribution: AWS support totally helped me get this working.

Alejandro C De Baca
- 888
- 2
- 9
- 16
-
Note that you would substitute `us-east-1` for the region in which your Athena view lives. – Alejandro C De Baca Jan 28 '19 at 16:31
-
1
-
@user2768132: Were you able to get all the rows from the athena view? – Nabeel Khan Ghauri May 17 '22 at 13:33
-
@NabeelKhanGhauri I did, however it was not a particularly large data set. Ended up using a different solution in the end that did not depend on views. – Alejandro C De Baca May 19 '22 at 22:08
-
@AlejandroCDeBaca: Thanks for sharing. We are trying to use views but somehow the job appends "where 1=0" while reading from the view. This results in zero rows. We have opened a case(AWS). – Nabeel Khan Ghauri May 20 '22 at 06:37
-
1@NabeelKhanGhauri Could you please let me know if you got the reply from AWS regarding appending "where 1=0" issue while reading from the view? – suyash308 Mar 14 '23 at 10:17