Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

votes

3 answers

Python logging.getLogger not working in AWS Glue python shell job

I am trying to set up a logger for my AWS Glue job using Python's logging module. I have a Glue job with the type as "Python Shell" using Python version 3. Logging works fine if I instantiate the logger without any name, but if I give my logger a…

asked Jan 24 '20 at 22:46

Steve

2,401
3
24
28

votes

3 answers

AWS Athena - duplicate columns due to partitionning

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly. The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have…

amazon-web-services amazon-s3 avro aws-glue amazon-athena

asked Dec 10 '19 at 13:47

Yannick

1,240
2
13
25

votes

4 answers

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will…

python amazon-web-services pyspark etl aws-glue

asked Nov 22 '18 at 19:21

joshuahornby10

4,222
8
36
52

votes

1 answer

Access AWS Glue from local Spark

Is there any way to run local master Spark SQL queries against AWS Glue? Launch this code on my local PC: SparkSession.builder() .master("local") .enableHiveSupport() .config("hive.metastore.client.factory.class",…

amazon-web-services apache-spark apache-spark-sql aws-glue

asked Sep 15 '18 at 12:49

VB_

45,112
42
145
293

votes

2 answers

Is it required to run AWS Glue crawler to detect new data before executing an ETL job?

AWS Glue docs clearly states that Crawlers scrapes metadata information from the source (JDBS or s3) and populates Data Catalog (creates/updates DB and corresponding tables). However, it's not clear whether we need to run a crawler regularly to…

amazon-web-services aws-glue

asked Apr 11 '18 at 13:35

Yuriy Bondaruk

4,512
2
33
49

votes

4 answers

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS. The aws-java-sdk-glue doesn't contain…

scala pyspark sbt aws-glue

asked Mar 13 '18 at 10:42

James

1,095
7
20

votes

2 answers

Is there a temporary folder that I can access while using AWS Glue?

Is there a temporary folder that I can access to hold files temporarily while running processes within AWS glue? For example, in Lambda we have access to a /tmp directory as long as the process is executing. Do we have something similar in AWS…

amazon-web-services pyspark aws-glue

asked Jan 12 '18 at 18:29

Leyth G

1,103
2
15
38

votes

3 answers

Is there a way to run aws glue crawler after job is finished?

For example I run ETL and new fields or columns may be added for target table. To detect table changes a crawler should be run but it has only manual or schedule run. Can crawler be triggered after job is finished?

amazon-web-services aws-glue

asked Jan 11 '18 at 05:46

Cherry

31,309
66
224
364

votes

4 answers

AWS Glue - Truncate destination postgres table prior to insert

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE. Has anyone been able to do so?

python postgresql pyspark aws-glue

asked Nov 02 '17 at 17:16

Josh Hamann

votes

4 answers

AWS Glue jobs not writing to S3

I have just been playing around with Glue but have yet to get it to successfully create a new table in an existing S3 bucket. The job will execute without error but there is never any output in S3. Here's what the auto generated code…

amazon-s3 aws-glue

asked Sep 21 '17 at 05:59

billobo

votes

7 answers

How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue

Is it possible to execute arbitrary SQL commands like ALTER TABLE from AWS Glue python job? I know I can use it to read data from tables but is there a way to execute other database specific commands? I need to ingest data into a target database and…

pyspark aws-glue py4j

asked Nov 10 '20 at 19:46

mishkin

5,932
8
45
64

votes

2 answers

TypeError: 'JavaPackage' object is not callable AWS Glue Pyspark

I am trying to setup AWS Glue environment on my ubuntu Virtual box by following AWS documentation. I have done the needful like downloading aws glue libs, spark package and setting up spark home as suggested. After that, i am not able to initialize…

java pyspark aws-glue

asked Apr 12 '20 at 15:06

rpshgupta

votes

6 answers

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake…

apache-spark amazon-s3 aws-glue delta-lake

asked Oct 02 '19 at 06:00

gorros

1,411
1
18
29

votes

2 answers

Terraform AWS Athena to use Glue catalog as db

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database. I use resource "aws_glue_catalog_database" "catalog_database" { name = "${var.glue_db_name}" } resource "aws_glue_crawler" "datalake_crawler" { …

amazon-web-services terraform aws-glue terraform-provider-aws aws-glue-data-catalog

asked Mar 12 '19 at 19:08

Steven

3,238
21
50

votes

0 answers

AWS Glue disable sslmode for target connections

Am quite new to AWS Glue; we are building an ETL process that pulls data from an external source on a MySQL database into Redshift. After adding the connections it and testing them it would connect successfully to the instance (without…

mysql jdbc etl aws-glue

asked Aug 20 '18 at 15:23

Mo J. Mughrabi

6,747
16
85
143

Prev 1 2 3

…

99 100 Next