Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
15
votes
3 answers

AWS Glue takes a long time to finish

I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", table_name="fctable") l_table =…
Shawn
  • 5,130
  • 13
  • 66
  • 109
15
votes
1 answer

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark…
14
votes
3 answers

How to use extra files for AWS glue job

I have an ETL job written in python, which consist of multiple scripts with following directory structure; my_etl_job | |--services | | | |-- __init__.py | |-- dynamoDB_service.py | |-- __init__.py |-- main.py |-- logger.py main.py is…
Anum Sheraz
  • 2,383
  • 1
  • 29
  • 54
14
votes
2 answers

AWS Athena - GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters

I'm querying a table in Athena that is giving the error: GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters I was able to query it earlier, but added another partition (AWS glue job) to try and optimize joins I will…
Neil Galloway
  • 141
  • 1
  • 1
  • 5
14
votes
0 answers

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some…
14
votes
2 answers

AWS Glue output file name

I am using AWS to transform some JSON files. I have added the files to Glue from S3. The job I have set up reads the files in ok, the job runs successfully, there is a file added to the correct S3 bucket. The issue I have is that I cant name the…
Ewan Peters
  • 141
  • 1
  • 1
  • 5
14
votes
1 answer

using AWS Glue with Apache Avro on schema changes

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data…
CharStar
  • 427
  • 1
  • 6
  • 24
13
votes
1 answer

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$"…
Lina
  • 1,217
  • 1
  • 15
  • 28
13
votes
1 answer

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions. The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters…
null
  • 1,944
  • 1
  • 14
  • 24
13
votes
2 answers

How do I set multiple --conf table parameters in AWS Glue?

Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job. I've tried the following ways to have multiple --conf values set all…
Zambonilli
  • 4,358
  • 1
  • 18
  • 18
13
votes
1 answer

AWS Glue crawler - partition keys types

I am using Spark to write files to S3 in ORC format. Also using Athena to query this data. I am using the following partition keys: s3://bucket/company=1123/date=20190207 Once I execute the Glue crawler to run on the bucket everything works as…
Alex Stanovsky
  • 1,286
  • 1
  • 13
  • 28
13
votes
5 answers

How set name for crawled table?

AWS crawler has prefix property for adding new tables. So If I leave prefix empty and start crawler to s3://my-bucket/some-table-backup it creates table with name some-table-backup. Is there a way to rename it to my-awesome-table and keep crawler…
Cherry
  • 31,309
  • 66
  • 224
  • 364
13
votes
3 answers

glue job for redshift connection: "Unable to find suitable security group"

I'm trying to set up a AWS Glue job and make a connection to Redshift. I'm getting error when I set the connection type to Redshift: "Unable to find a suitable security group. Change connection type to JDBC and retry adding your…
user3871
  • 12,432
  • 33
  • 128
  • 268
12
votes
0 answers

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3: We used a dynamic frame to write to S3 We used a pure spark frame to write to S3 Same as 1 but reducing the number of worker nodes…
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
12
votes
1 answer

Using AWS glue schema registry with confluent SerDe clients

For supporting schema registry on my MSK topic, I found two options - AWS Glue Schema Registry; and Confluent Schema Registry Since, Glue SR is fully managed by AWS, I would prefer to use that. However, my producer and consumer clients are written…