Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

votes

3 answers

AWS Glue takes a long time to finish

I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", table_name="fctable") l_table =…

amazon-web-services aws-glue

asked Aug 29 '17 at 19:36

Shawn

5,130
13
66
109

votes

1 answer

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark…

amazon-web-services apache-spark amazon-s3 amazon-emr aws-glue

asked Aug 18 '17 at 04:55

Thiago Baldim

7,362
3
29
51

votes

3 answers

How to use extra files for AWS glue job

python amazon-s3 aws-glue

asked Apr 14 '20 at 21:50

Anum Sheraz

2,383
1
29
54

votes

2 answers

AWS Athena - GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters

I'm querying a table in Athena that is giving the error: GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters I was able to query it earlier, but added another partition (AWS glue job) to try and optimize joins I will…

amazon-web-services aws-glue presto amazon-athena

asked Jul 10 '19 at 22:10

Neil Galloway

votes

0 answers

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some…

apache-spark amazon-emr aws-glue hive-metastore aws-glue-data-catalog

asked Jan 09 '19 at 21:19

Sridher

votes

2 answers

AWS Glue output file name

I am using AWS to transform some JSON files. I have added the files to Glue from S3. The job I have set up reads the files in ok, the job runs successfully, there is a file added to the correct S3 bucket. The issue I have is that I cant name the…

amazon-web-services amazon-s3 aws-glue

asked Feb 13 '18 at 15:18

Ewan Peters

votes

1 answer

using AWS Glue with Apache Avro on schema changes

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data…

amazon-web-services amazon-s3 avro aws-glue

asked Feb 09 '18 at 20:58

CharStar

votes

1 answer

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$"…

amazon-web-services aws-glue aws-glue-spark aws-glue-workflow

asked Jan 11 '21 at 13:42

Lina

1,217
1
15
28

votes

1 answer

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions. The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters…

amazon-web-services nosql aws-glue presto amazon-athena

asked Dec 26 '19 at 12:18

null

1,944
1
14
24

votes

2 answers

How do I set multiple --conf table parameters in AWS Glue?

Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job. I've tried the following ways to have multiple --conf values set all…

amazon-web-services aws-glue

asked Apr 04 '19 at 19:36

Zambonilli

4,358
1
18
18

votes

1 answer

AWS Glue crawler - partition keys types

I am using Spark to write files to S3 in ORC format. Also using Athena to query this data. I am using the following partition keys: s3://bucket/company=1123/date=20190207 Once I execute the Glue crawler to run on the bucket everything works as…

amazon-s3 amazon-athena aws-glue aws-glue-data-catalog

asked Feb 07 '19 at 13:56

Alex Stanovsky

1,286
1
13
28

votes

5 answers

How set name for crawled table?

AWS crawler has prefix property for adding new tables. So If I leave prefix empty and start crawler to s3://my-bucket/some-table-backup it creates table with name some-table-backup. Is there a way to rename it to my-awesome-table and keep crawler…

amazon-web-services aws-glue

asked Jan 18 '18 at 13:18

Cherry

31,309
66
224
364

votes

3 answers

glue job for redshift connection: "Unable to find suitable security group"

I'm trying to set up a AWS Glue job and make a connection to Redshift. I'm getting error when I set the connection type to Redshift: "Unable to find a suitable security group. Change connection type to JDBC and retry adding your…

python amazon-web-services jdbc amazon-redshift aws-glue

asked Oct 02 '17 at 18:30

user3871

12,432
33
128
268

votes

0 answers

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3: We used a dynamic frame to write to S3 We used a pure spark frame to write to S3 Same as 1 but reducing the number of worker nodes…

amazon-web-services apache-spark amazon-s3 aws-glue

asked Dec 21 '21 at 08:25

justHelloWorld

6,478
8
58
138

votes

1 answer

Using AWS glue schema registry with confluent SerDe clients

For supporting schema registry on my MSK topic, I found two options - AWS Glue Schema Registry; and Confluent Schema Registry Since, Glue SR is fully managed by AWS, I would prefer to use that. However, my producer and consumer clients are written…

amazon-web-services apache-kafka aws-glue confluent-schema-registry

asked Jan 28 '21 at 03:19

user12345

Prev 1 2

…

99 100 Next