Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
1
vote
1 answer

How do I convert String to date in AWS Glue?

When I run crawler from Glue on parquet/csv file in S3 bucket, It takes Date as a string. I changed in edit schema and set it to Date. Not only date but it's not changing any data type in edit schema. When I fire query from Athena "select order_date…
1
vote
1 answer

How to get the resource link for a glue job in the StepFunction state machine execution event history?

I'm using StepFunction to start a glue job, oddly I found in the Execution event history I couldn't get a resource link to this glue job, whereas it can give me the correspoding Lambda & logs link, am I missing permissions in the step function…
1
vote
1 answer

What's the difference between startjobrun and getjobrun- StepFunction with Glue?

This is the doc: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-runs.html#aws-glue-api-jobs-runs-StartJobRun I have a stepfunction, I want to use it to trigger an existing Glue job and run it, should I use startjobrun or getjobrun? My…
1
vote
1 answer

How to import RateLimiter in AWS Glue Python

I want to add a rate limiter for calls from my python script glue job to DDB, and mitigate its call volume spikes. I implemented something like the following, like what is suggested in https://pypi.org/project/ratelimiter/ : from ratelimiter import…
user8142520
  • 751
  • 1
  • 6
  • 20
1
vote
2 answers

Athena partition projection date formatting

I want to use Athena partition projection to write queries that filter on date partition columns. The issue is that I need the physical data format that will be projected (the S3 file prefixes) to differ from the date formats that users will query…
Golammott
  • 486
  • 2
  • 15
1
vote
1 answer

how to setup modules in the Glue job script

I'm following this tutorial: https://www.youtube.com/watch?v=EzQArFt_On4 In this tutorial it's only using one python script, what is I need to import some functions from another python script? For example: import script2 I wonder what's the correct…
wawawa
  • 2,835
  • 6
  • 44
  • 105
1
vote
1 answer

AWS Glue. How to create a compound key for Job bookmarks?

I have a JDBC source (PostgreSQL) with a table, which I want to fetch by Glue. My table has columns: id (bigint) name (string) updated_at (timestamp) I've set up the table in the Glue data catalog with a crawler, set up a job and…
Alex
  • 27
  • 8
1
vote
1 answer

add missing column to AWS Glue DataFrame

I am reading a DynamoDB Table with Glue, due to the dynamic schema it can happen that some columns are not existing. Adding them works fine with the following code but I am not sure how to make the function dynamic if I need to add multiple…
Tobias Bruckert
  • 348
  • 2
  • 12
1
vote
1 answer

AWS Glue Workflow to trigger email on any ETL job failure

In AWS Glue, I am executing a couple of ETL jobs using workflow, Now I want to inform business via email on the failure of any of the ETL jobs. I need help to get name of failed job and pass it to job which would trigger an email.
1
vote
1 answer

Using Deequ on AWS Glue

I am using Deequ on AWS GLUE, surprisingly when I was to run the hasMaxLength which is listed under Checks for the verificationSuite. I get the following error, can someone help? All other checks are passing/running. It says the check hasMaxLength…
user3476582
  • 75
  • 1
  • 10
1
vote
0 answers

AWS Glue Connection To RDS Mysql v8

I am connecting my AWS Glue to a RDS instance in external account. These are my code, and I have done VPC peering, open all TCP ports and public accessibility. (i have another rds running on MySQL v5 in the same VPC in external account and the glue…
1
vote
0 answers

Running AWS Glue ETL Job (Spark) for large data

Currently, I have a GLUE ETL Script in Scala. Following are my GLUE script settings: Spark 2.4, Scala 2 (Glue Version 2.0) Worker type : G1.X (Recommended for memory intensive job) Number of workers : 10 I am reading 60 GB data in the database…
1
vote
0 answers

What is the relation between hashpartitions and no of worker when we are using from_options aws glue?

I have created glue job to read the data from oracle by using below code. WhereQuery="select * from test where dated>==CURRENT_DATE-4 connection_oracle11_options = { "url": URL, "dbtable": tableName, "user": USERNAME, "password":…
Sai
  • 1,075
  • 5
  • 31
  • 58
1
vote
1 answer

Step function to invoke glue job and lambda function with passed parameters

Scenario : I want to pass S3 (source file location) and the s3 (output file location) as input parameters in my workflow . Workflow : Aws Step Function calls -> lambda function and lambda function calls -> the glue job, I want to pass the parameters…
1
vote
1 answer

Why doesn't AWS Glue generate spark event logs

I have an AWS glue job with Spark UI enabled by following this instruction: Enabling the Spark UI for Jobs The glue job has s3:* access to arn:aws:s3:::my-spark-event-bucket/* resource. But for some reason, when I run the glue job (and it…
1 2 3
99
100