Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

votes

2 answers

How to monitor and control DPU usage in AWS Glue Crawlers

In the docs it's said that AWS allocates by default 10 DPUs per ETL job and 5 DPUs per development endpoint by default, even though both can have a minimum of 2 DPUs configured. It's also mentioned that Crawling is also priced on second increments…

amazon-web-services aws-glue

asked Mar 07 '18 at 21:10

villasv

6,304
2
44
78

votes

2 answers

Glue crawler exclude patterns

I have an s3 bucket that I'm trying to crawl and catalog. The format is something like this, where the SQL files are DDL queries (CREATE TABLE statements) that match the schema of the different data files, i.e. data1, data2, etc.)…

aws-glue

asked Feb 15 '18 at 16:55

Kirk Broadhurst

27,836
16
104
169

votes

1 answer

Use external table redshift spectrum defined in glue data catalog

I have a table defined in Glue data catalog that I can query using Athena. As there is some data in the table that I want to use with other Redshift tables, can I access the table defined in Glue data catalog? What will be the create external table…

amazon-web-services amazon-redshift amazon-athena aws-glue amazon-redshift-spectrum

asked Jan 10 '18 at 06:23

Abhay Dubey

votes

5 answers

AWS Glue does not detect partitions and creates 1000+ tables in catalog

I am using AWS Glue to create metadata tables. AWS Glue Crawler data store path: s3://bucket-name/ Bucket structure in S3 is like ├── bucket-name │ ├── pt=2011-10-11-01 │ │ ├── file1 | | ├── file2 …

amazon-web-services amazon-s3 aws-glue

asked Jan 09 '18 at 10:27

iammehrabalam

1,285
3
14
25

votes

3 answers

Aws glue triggers are not working

I have tried to run AWS glue trigger with proper values but it is not going to run job, On which we have setup the trigger. For instance I have Job1 and Job2. On complition of Job1 i want to run Job2. Job1 is getting passed but it is unable to…

amazon-web-services aws-glue

asked Nov 06 '17 at 11:47

pavan yadav

votes

3 answers

AWS Glue ETL job from AWS Redshift to S3 fails

I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource…

amazon-web-services amazon-s3 amazon-redshift aws-glue

asked Aug 22 '17 at 08:50

user_default

votes

1 answer

How to view AWS Glue Spark UI

In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work. How can I view the DAG/Spark UI of my Glue job?

amazon-web-services pyspark aws-glue directed-acyclic-graphs spark-ui

asked Nov 19 '19 at 14:31

Ankur Shrivastava

votes

1 answer

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events. But, I see very narrowed…

amazon-web-services amazon-s3 aws-lambda aws-glue

asked Aug 26 '19 at 05:58

Aakash Basu

1,689
7
28
57

votes

3 answers

Specify a SerDe serialization lib with AWS Glue Crawler

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog…

amazon-web-services amazon-athena aws-glue aws-glue-data-catalog

asked Aug 14 '19 at 16:04

Luigi Plinge

50,650
20
113
180

votes

0 answers

AWS Glue Job: SchemaColumnConvertNotSupportedException when trying to write parquet file to S3

I have a table in the AWS Glue catalog that has datatypes of all strings and the files are stored as parquet files in S3. I want to create a Glue job that will simply read the data in from that catalog, partition the files by the date, then write…

python apache-spark amazon-s3 pyspark aws-glue

asked Aug 08 '19 at 13:28

Joey Donovan

votes

2 answers

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter…

amazon-s3 pyspark etl aws-glue

asked Jul 09 '19 at 18:43

hgpestana

votes

1 answer

Why are new columns added to parquet tables not available from glue pyspark ETL jobs?

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs…

pyspark parquet aws-glue

asked Apr 09 '19 at 04:20

roby

3,103
1
16
14

votes

2 answers

AWS Glue: ETL to read S3 CSV files

I want to use ETL to read data from S3. Since with ETL jobs I can set DPU to hopefully speed things up. But how do I do it? I tried import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context…

amazon-web-services amazon-s3 pyspark etl aws-glue

asked Nov 01 '18 at 15:10

Jiew Meng

84,767
185
495
805

votes

1 answer

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as? Thank you Error Message Appearing

amazon-web-services jobs amazon-athena aws-glue

asked Nov 01 '18 at 13:30

Nikitas Bompolias

votes

2 answers

AWS Glue predicate push down condition has no effect

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows datasource = glueContext.create_dynamic_frame_from_catalog( database = source_catalog_db, table_name = source_catalog_tbl, …

mysql python-3.x amazon-web-services pyspark aws-glue

asked Jul 17 '18 at 19:34

Anas Ismail

Prev 1 2 3

…

99 100 Next