Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

vote

1 answer

To write DataFrame in xml format, compressed in zip to a target store

I am trying to code in AWS Glue ETL to write dataframe in XML format, compressed in ZIPPED to be loaded in the s3 folder. I have been able to write the code for JSON, parquet, orc but unable to find any for XML. The main error was: DataFrameWriter…

asked Mar 17 '21 at 08:48

Arijit Roy

vote

0 answers

AWS glue delete batch partitions

delete batch partition api using is little trickly while we have to iterate 25 batches at time.need help for the same.

amazon-web-services hive aws-glue

asked Mar 11 '21 at 22:20

Jash

vote

0 answers

How to Throttle AWS Glue/PySpark Writes to Elasticsearch

I am using the following code to write a pyspark dataframe to elasticsearch via AWS Glue. df.write.format("org.elasticsearch.spark.sql").\ mode("overwrite").\ option("es.resource", "{}/_doc".format(es_index_name)).\ option("es.nodes",…

apache-spark elasticsearch pyspark aws-glue

asked Mar 08 '21 at 02:21

pyspark-developer

vote

0 answers

AWS Glue ETL MongoDB Connection String Error

Issue using MongoDb with AWS glue - I've created a connection to the database (using the MongoDb connection option) and run a crawler against it and it all worked fine, but when I try to use this as a datasource in a basic ETL job (script- Glue…

mongodb amazon-web-services etl aws-glue

asked Mar 05 '21 at 13:35

user598241

vote

1 answer

AWS Glue job failed with error 'ERROR Client: Application diagnostics message: User application exited with status 1'

I'm recently using AWS Glue job to test to run some spark python codes, I kicked off a run yesterday and it succeeded, this morning, without any changes, I kicked off three times and it all failed. The logs are weird and I don't understand...: This…

amazon-web-services apache-spark awk pyspark aws-glue

asked Mar 04 '21 at 10:20

wawawa

2,835
6
44
105

vote

0 answers

AWS Crawler stuck in STOPPING state

I have a crawler that was working fine defining the schema of a parquet file in S3, but I came across a CrawlerRunningException when I ran it again. I checked its status and it has been stuck in STOPPING for some reason I don´t get. I can´t even…

amazon-web-services amazon-s3 aws-glue aws-glue-data-catalog schemacrawler

asked Mar 03 '21 at 11:39

Nico Arbar

vote

1 answer

Passing IBucket to bucket property in Table props results in missing property error

s3Bucket = s3.Bucket.fromBucketName(this, bucketName, bucketName); let glueTable = new glue.Table(this, tableName, { database: glueDb, tableName: tableName, bucket: s3Bucket }) The IDE throws this…

typescript amazon-s3 aws-glue aws-cdk

asked Mar 03 '21 at 11:33

Karthik Vedantham

vote

1 answer

AWS Glue - DynamicFrame with varying schema in json files

Sample: I have a partitioned table with DDL below in Glue catalog: CREATE EXTERNAL TABLE `test`( `id` int, `data` struct) PARTITIONED BY ( `partition_0` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'…

apache-spark pyspark aws-glue aws-glue-spark

asked Mar 02 '21 at 21:45

Leonid

vote

2 answers

AWS Glue - Replacing field names containing "." with "_"

I am trying to replace all the fields which have "." within the field name to "_". This is what I have: def apply_renaming_mapping(df): """Given a dynamic data frame, if the field contains ., replace with _""" # construct renaming mapping…

python aws-glue aws-glue-spark

asked Mar 01 '21 at 22:50

molly_567

vote

1 answer

How to get the last accessed partitions from AWS Glue using Boto3

Im using the below function to get all partitions from AWS Glue catalog table. There are some tables in the database that has more than 50K partitions. Is it possible to get only the partitions based on the 'LastAccessTime' attribute. I know I can…

amazon-web-services boto3 aws-glue

asked Feb 28 '21 at 00:58

Lisa Mathew

vote

1 answer

Error When trying to copy to AWS Glue tmp folder in python shell

I'm trying to copy some files over to the tmp folder using boto3 in a glue job. Here's my code: import pandas as pd import numpy as np import boto3 bucketname = "" s3 = boto3.resource('s3') my_bucket = s3.Bucket(bucketname) print('line…

python amazon-s3 boto3 amazon-iam aws-glue

asked Feb 25 '21 at 20:50

Ravmcgav

vote

0 answers

Missing executor CloudWatch logs for AWS Glue version 2.0 ETL job

I am running glueetl (Glue version 2.0) job using Python with the below configuration for logging. I have continuous logging enabled. I get INFO entries from the driver present in /aws-glue/jobs/output log group, however there are no INFO entries…

amazon-web-services amazon-cloudwatch aws-glue amazon-cloudwatchlogs aws-glue-spark

asked Feb 18 '21 at 16:54

Krzysztof Słowiński

6,239
8
44
62

vote

2 answers

AWS Glue ETL Spark- string to timestamp

I am trying to convert my CSVs to Parquet via AWS Glue ETL Job. At the same time, I am willing to convert my datetime column (string) to timestamp format that Athena can recognize. (Athena recognizes this yyyy-MM-dd HH:mm:ss) I skimmed and applied…

parquet aws-glue string-to-datetime aws-glue-spark

asked Feb 12 '21 at 10:34

Omur

vote

2 answers

Athena Best Practice to store query result

I am creating a Data Lake and have some tables in Glue Catalog that I need to query in Athena. As a prerequisite, Athena requires us to store the query results in a S3 bucket. I have "Temp" and "Logs" S3 buckets. But since this is client sensitive…

amazon-web-services aws-glue amazon-athena aws-glue-data-catalog

asked Feb 11 '21 at 21:28

Gunjan Khandelwal

vote

2 answers

AWS Glue Outputting Empty Files on Sequential Runs

I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want.…

amazon-web-services etl aws-glue

asked Feb 11 '21 at 00:56

Andrew Chen

Prev 1 2 3

…

100 Next