Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
11
votes
3 answers

Python logging.getLogger not working in AWS Glue python shell job

I am trying to set up a logger for my AWS Glue job using Python's logging module. I have a Glue job with the type as "Python Shell" using Python version 3. Logging works fine if I instantiate the logger without any name, but if I give my logger a…
Steve
  • 2,401
  • 3
  • 24
  • 28
11
votes
3 answers

AWS Athena - duplicate columns due to partitionning

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly. The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have…
Yannick
  • 1,240
  • 2
  • 13
  • 25
11
votes
4 answers

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will…
joshuahornby10
  • 4,222
  • 8
  • 36
  • 52
11
votes
1 answer

Access AWS Glue from local Spark

Is there any way to run local master Spark SQL queries against AWS Glue? Launch this code on my local PC: SparkSession.builder() .master("local") .enableHiveSupport() .config("hive.metastore.client.factory.class",…
VB_
  • 45,112
  • 42
  • 145
  • 293
11
votes
2 answers

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables). However, it's not clear whether we need to run a crawler regularly to…
Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49
11
votes
4 answers

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS. The aws-java-sdk-glue doesn't contain…
James
  • 1,095
  • 7
  • 20
11
votes
2 answers

Is there a temporary folder that I can access while using AWS Glue?

Is there a temporary folder that I can access to hold files temporarily while running processes within AWS glue? For example, in Lambda we have access to a /tmp directory as long as the process is executing. Do we have something similar in AWS…
Leyth G
  • 1,103
  • 2
  • 15
  • 38
11
votes
3 answers

Is there a way to run aws glue crawler after job is finished?

For example I run ETL and new fields or columns may be added for target table. To detect table changes a crawler should be run but it has only manual or schedule run. Can crawler be triggered after job is finished?
Cherry
  • 31,309
  • 66
  • 224
  • 364
11
votes
4 answers

AWS Glue - Truncate destination postgres table prior to insert

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE. Has anyone been able to do so?
Josh Hamann
  • 123
  • 1
  • 1
  • 8
11
votes
4 answers

AWS Glue jobs not writing to S3

I have just been playing around with Glue but have yet to get it to successfully create a new table in an existing S3 bucket. The job will execute without error but there is never any output in S3. Here's what the auto generated code…
billobo
  • 111
  • 1
  • 3
10
votes
7 answers

How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands? I need to ingest data into a target database and…
mishkin
  • 5,932
  • 8
  • 45
  • 64
10
votes
2 answers

TypeError: 'JavaPackage' object is not callable AWS Glue Pyspark

I am trying to setup AWS Glue environment on my ubuntu Virtual box by following AWS documentation. I have done the needful like downloading aws glue libs, spark package and setting up spark home as suggested. After that, i am not able to initialize…
rpshgupta
  • 135
  • 1
  • 8
10
votes
6 answers

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake…
gorros
  • 1,411
  • 1
  • 18
  • 29
10
votes
2 answers

Terraform AWS Athena to use Glue catalog as db

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database. I use resource "aws_glue_catalog_database" "catalog_database" { name = "${var.glue_db_name}" } resource "aws_glue_crawler" "datalake_crawler" { …
10
votes
0 answers

AWS Glue disable sslmode for target connections

Am quite new to AWS Glue; we are building an ETL process that pulls data from an external source on a MySQL database into Redshift. After adding the connections it and testing them it would connect successfully to the instance (without…
Mo J. Mughrabi
  • 6,747
  • 16
  • 85
  • 143