Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
1
vote
1 answer

To write DataFrame in xml format, compressed in zip to a target store

I am trying to code in AWS Glue ETL to write dataframe in XML format, compressed in ZIPPED to be loaded in the s3 folder. I have been able to write the code for JSON, parquet, orc but unable to find any for XML. The main error was: DataFrameWriter…
Arijit Roy
  • 11
  • 2
1
vote
0 answers

AWS glue delete batch partitions

delete batch partition api using is little trickly while we have to iterate 25 batches at time.need help for the same.
Jash
  • 11
  • 2
1
vote
0 answers

How to Throttle AWS Glue/PySpark Writes to Elasticsearch

I am using the following code to write a pyspark dataframe to elasticsearch via AWS Glue. df.write.format("org.elasticsearch.spark.sql").\ mode("overwrite").\ option("es.resource", "{}/_doc".format(es_index_name)).\ option("es.nodes",…
1
vote
0 answers

AWS Glue ETL MongoDB Connection String Error

Issue using MongoDb with AWS glue - I've created a connection to the database (using the MongoDb connection option) and run a crawler against it and it all worked fine, but when I try to use this as a datasource in a basic ETL job (script- Glue…
user598241
  • 83
  • 8
1
vote
1 answer

AWS Glue job failed with error 'ERROR Client: Application diagnostics message: User application exited with status 1'

I'm recently using AWS Glue job to test to run some spark python codes, I kicked off a run yesterday and it succeeded, this morning, without any changes, I kicked off three times and it all failed. The logs are weird and I don't understand...: This…
wawawa
  • 2,835
  • 6
  • 44
  • 105
1
vote
0 answers

AWS Crawler stuck in STOPPING state

I have a crawler that was working fine defining the schema of a parquet file in S3, but I came across a CrawlerRunningException when I ran it again. I checked its status and it has been stuck in STOPPING for some reason I don´t get. I can´t even…
1
vote
1 answer

Passing IBucket to bucket property in Table props results in missing property error

s3Bucket = s3.Bucket.fromBucketName(this, bucketName, bucketName); let glueTable = new glue.Table(this, tableName, { database: glueDb, tableName: tableName, bucket: s3Bucket }) The IDE throws this…
1
vote
1 answer

AWS Glue - DynamicFrame with varying schema in json files

Sample: I have a partitioned table with DDL below in Glue catalog: CREATE EXTERNAL TABLE `test`( `id` int, `data` struct) PARTITIONED BY ( `partition_0` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'…
Leonid
  • 11
  • 2
1
vote
2 answers

AWS Glue - Replacing field names containing "." with "_"

I am trying to replace all the fields which have "." within the field name to "_". This is what I have: def apply_renaming_mapping(df): """Given a dynamic data frame, if the field contains ., replace with _""" # construct renaming mapping…
molly_567
  • 113
  • 3
1
vote
1 answer

How to get the last accessed partitions from AWS Glue using Boto3

Im using the below function to get all partitions from AWS Glue catalog table. There are some tables in the database that has more than 50K partitions. Is it possible to get only the partitions based on the 'LastAccessTime' attribute. I know I can…
Lisa Mathew
  • 305
  • 4
  • 18
1
vote
1 answer

Error When trying to copy to AWS Glue tmp folder in python shell

I'm trying to copy some files over to the tmp folder using boto3 in a glue job. Here's my code: import pandas as pd import numpy as np import boto3 bucketname = "" s3 = boto3.resource('s3') my_bucket = s3.Bucket(bucketname) print('line…
Ravmcgav
  • 183
  • 1
  • 1
  • 11
1
vote
0 answers

Missing executor CloudWatch logs for AWS Glue version 2.0 ETL job

I am running glueetl (Glue version 2.0) job using Python with the below configuration for logging. I have continuous logging enabled. I get INFO entries from the driver present in /aws-glue/jobs/output log group, however there are no INFO entries…
1
vote
2 answers

AWS Glue ETL Spark- string to timestamp

I am trying to convert my CSVs to Parquet via AWS Glue ETL Job. At the same time, I am willing to convert my datetime column (string) to timestamp format that Athena can recognize. (Athena recognizes this yyyy-MM-dd HH:mm:ss) I skimmed and applied…
Omur
  • 136
  • 1
  • 7
1
vote
2 answers

Athena Best Practice to store query result

I am creating a Data Lake and have some tables in Glue Catalog that I need to query in Athena. As a prerequisite, Athena requires us to store the query results in a S3 bucket. I have "Temp" and "Logs" S3 buckets. But since this is client sensitive…
1
vote
2 answers

AWS Glue Outputting Empty Files on Sequential Runs

I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want.…
1 2 3
99
100