Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
10
votes
2 answers

How to monitor and control DPU usage in AWS Glue Crawlers

In the docs it's said that AWS allocates by default 10 DPUs per ETL job and 5 DPUs per development endpoint by default, even though both can have a minimum of 2 DPUs configured. It's also mentioned that Crawling is also priced on second increments…
villasv
  • 6,304
  • 2
  • 44
  • 78
10
votes
2 answers

Glue crawler exclude patterns

I have an s3 bucket that I'm trying to crawl and catalog. The format is something like this, where the SQL files are DDL queries (CREATE TABLE statements) that match the schema of the different data files, i.e. data1, data2, etc.)…
Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169
10
votes
1 answer

Use external table redshift spectrum defined in glue data catalog

I have a table defined in Glue data catalog that I can query using Athena. As there is some data in the table that I want to use with other Redshift tables, can I access the table defined in Glue data catalog? What will be the create external table…
10
votes
5 answers

AWS Glue does not detect partitions and creates 1000+ tables in catalog

I am using AWS Glue to create metadata tables. AWS Glue Crawler data store path: s3://bucket-name/ Bucket structure in S3 is like ├── bucket-name │   ├── pt=2011-10-11-01 │   │   ├── file1 | | ├── file2 …
iammehrabalam
  • 1,285
  • 3
  • 14
  • 25
10
votes
3 answers

Aws glue triggers are not working

I have tried to run AWS glue trigger with proper values but it is not going to run job, On which we have setup the trigger. For instance I have Job1 and Job2. On complition of Job1 i want to run Job2. Job1 is getting passed but it is unable to…
pavan yadav
  • 159
  • 2
  • 9
10
votes
3 answers

AWS Glue ETL job from AWS Redshift to S3 fails

I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource…
9
votes
1 answer

How to view AWS Glue Spark UI

In my Glue job, I have enabled Spark UI and specified all the necessary details (s3 related etc.) needed for Spark UI to work. How can I view the DAG/Spark UI of my Glue job?
9
votes
1 answer

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events. But, I see very narrowed…
Aakash Basu
  • 1,689
  • 7
  • 28
  • 57
9
votes
3 answers

Specify a SerDe serialization lib with AWS Glue Crawler

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in) I then need to manually edit the table details in the Glue Catalog…
9
votes
0 answers

AWS Glue Job: SchemaColumnConvertNotSupportedException when trying to write parquet file to S3

I have a table in the AWS Glue catalog that has datatypes of all strings and the files are stored as parquet files in S3. I want to create a Glue job that will simply read the data in from that catalog, partition the files by the date, then write…
Joey Donovan
  • 105
  • 3
9
votes
2 answers

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter…
hgpestana
  • 409
  • 1
  • 4
  • 11
9
votes
1 answer

Why are new columns added to parquet tables not available from glue pyspark ETL jobs?

We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs…
roby
  • 3,103
  • 1
  • 16
  • 14
9
votes
2 answers

AWS Glue: ETL to read S3 CSV files

I want to use ETL to read data from S3. Since with ETL jobs I can set DPU to hopefully speed things up. But how do I do it? I tried import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context…
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
9
votes
1 answer

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as? Thank you Error Message Appearing
9
votes
2 answers

AWS Glue predicate push down condition has no effect

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows datasource = glueContext.create_dynamic_frame_from_catalog( database = source_catalog_db, table_name = source_catalog_tbl, …
Anas Ismail
  • 93
  • 1
  • 1
  • 3