Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

vote

1 answer

format AWS glue spark dataframe output

I am trying to print my DataFrame on the log: datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "sellout_data_cw01_15_csv", transformation_ctx = "datasource0") .... dataframe =…

asked May 03 '21 at 15:15

x89

2,798
5
46
110

vote

1 answer

Py4JJavaError: An error occurred while calling o67.getDynamicFrame. java.lang.reflect.InvocationTargetException

While working on nested json file using DyamicFrame for struct type of data. When i run the jobs its getting this error Py4JJavaError: An error occurred while calling o67.getDynamicFrame. java.lang.reflect.InvocationTargetException.Let me know where…

aws-glue

asked May 03 '21 at 14:51

Parag Shahade

vote

1 answer

AWS Glue equivalent of EMRFS role mappings

In EMR, you can specify that operations that read/write S3 will assume a different IAM role depending on the S3 URL through EMRFS role mappings. This is especially useful for a use case of reading from a bucket in the same account, and writing to a…

amazon-web-services aws-glue

asked May 01 '21 at 18:06

wrschneider

17,913
16
96
176

vote

0 answers

Script location tab is empty in Glue Job

I'm trying to run the cloudformation template which has GlueJOB resource to create. GlueJob picks the python script from s3 bucket which is in ohio region using scriptlocation command. Template and the python scripts works perfectly fine in ohio…

amazon-s3 aws-cloudformation aws-glue

asked May 01 '21 at 07:13

Abhishek Hc

vote

1 answer

When i try to run Glue job using python to Relationalize array and struct data

When i try to run Glue job using python to Relationalize array and struct data I'm getting below error INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support,…

aws-glue

asked Apr 29 '21 at 09:33

Parag Shahade

vote

1 answer

Is there a way to use Apache Hudi on AWS glue?

Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job. Any blogs/articles which can help here as a starting point ?

apache-spark amazon-s3 aws-glue apache-hudi

asked Apr 28 '21 at 10:32

shikeb

vote

0 answers

FileAlreadyExistsException: File already exists:s3

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns (date and geohash) Mind you that this is PROD data and the environment is PROD.…

amazon-web-services apache-spark amazon-s3 pyspark aws-glue

asked Apr 26 '21 at 13:53

sparkcoderlv

vote

1 answer

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100…

amazon-web-services amazon-s3 aws-glue amazon-athena

asked Apr 23 '21 at 20:00

rabbittas2739

vote

0 answers

Glue Crawler creating a separate table for each csv file

I am copying csv files from source s3 bucket to destination s3 bucket. Once the files are in the destination bucket we have a glue crawler that populates the data in redshift. For some reason the crawler is creating a table for each csv file. I…

python amazon-web-services amazon-s3 boto3 aws-glue

asked Apr 23 '21 at 15:08

anaz8

vote

0 answers

Glue ETL job- Reading data from onpremise database- using catalog connection

I have a glue ETL job which write data to an onpremise postgreSql database. I'm unable to find an effective option within glue methods to read the data from same database using the jdbc connection. Below is the existing approach: Reads data from…

python-3.x amazon-web-services pyspark aws-glue aws-glue-spark

asked Apr 23 '21 at 14:31

srikanth A

vote

0 answers

How AWS Glue job paramenter 'MaxConcurrentRuns' relate to the concurrent executions of the StepFunction

I have a couple of StepFunctions staemachines that I want to run them concurrently, each of them will link to a Glue job, currently the MaxConcurrentRuns parameter for Glue job is set to the default value 1, I understand this means we can only have…

amazon-web-services aws-glue state-machine aws-step-functions

asked Apr 23 '21 at 13:20

wawawa

2,835
6
44
105

vote

0 answers

Using AWS glue schema registry with custom SerDe clients

For supporting schema registry on my MSK topic, I found two options - AWS Glue Schema Registry; and Confluent Schema Registry Since, Glue SR is fully managed by AWS, I would prefer to use that. However, my producer and consumer clients are written…

amazon-web-services aws-glue confluent-schema-registry

asked Apr 20 '21 at 17:30

user3082928

vote

0 answers

How to stop AWS glue from creating 1 individual table instead of multiple tables

I have folder structure as following in S3 Data table1/output/table1.csv table2/output/table2.csv table3/output/table3.csv My ideal goal is to have a Glue Crawler to have 3 respective tables created. Instead what is created is 1 table called…

amazon-web-services aws-glue aws-glue-data-catalog

asked Apr 19 '21 at 15:49

Sakibul Alam

vote

2 answers

How to load files to s3 from code commit using Cloudformation

I have a Glue job which will take gluescript from S3 bucket. The gluescript will be pushed to code commit and from code commit the gluescript will be automatically copied to S3. My query is : How to load files to s3 from code commit using…

amazon-web-services amazon-s3 aws-cloudformation aws-glue

asked Apr 19 '21 at 10:40

GIRIJA

vote

2 answers

Spark wrongly casting integers as `struct`

In a spark job, I am using .withColumn("year", year(to_timestamp(lit(col("timestamp"))))) This code used to work. But now I get the error : "cannot resolve 'CAST(`timestamp` AS TIMESTAMP)' due to data type mismatch: cannot cast…

apache-spark pyspark aws-glue

asked Apr 15 '21 at 09:08

Hugo

1,195
2
12
36

Prev 1 2 3

…

99 100 Next