Questions tagged [aws-glue-spark]
244 questions
2
votes
1 answer
Non-Partitioned Table Schema not updated with Glue ETL Job
We have an ETL job that uses the below code snippet to update the catalog table:
sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True,…

Krunal Patel
- 85
- 1
- 8
2
votes
0 answers
Glue secret manager integration: secretId is not provided
I am running the glue pyspark script from my local machine using the GlueETL library.
When creating a dataframe from glue catalog,
dyf_user_book_reading_stat = glueContext.create_dynamic_frame.from_catalog(
database="xxx-db",
…

sheetal_158
- 7,391
- 6
- 27
- 44
2
votes
1 answer
AWS Glue - Job Monitoring: Job Execution, Active Executors and Maximum Needed Executors not showing
I have set up an ETL job in AWS Glue with the following settings:
Glue v.3.0, Python v.3, Spark v.3.1 and Worker type G.1X with 10 Workers and Job metrics enabled.
When I'm looking at the job metrics after the job is finished, I see in the Job…

Qwaz
- 199
- 9
2
votes
1 answer
Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@e7b95c9 and DynamicNode: longnode
I am trying to extract 27 DynamoDB tables from a single Database using the Visual editor in AWS Glue. I have successfully crawled the database and my workflow for the job is.
Extract from Source table (DynamoDB).
Apply Transform (usually 1:1 and…

Ross Alxndr
- 73
- 2
- 4
2
votes
0 answers
AWS Gluescript missing a Parquet file
AWS Gluescript written in pyspark usually works great, creates Parquet files, but occasionally I am missing a Parquet file. How can I ensure / mitigate missing data?
pertinent code is:
FinalDF.write.partitionBy("Year",…

Judy K
- 31
- 2
2
votes
1 answer
Unable to add/import additional python library datacompy in aws glue
i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step
Open the AWS Glue console.
Under Job parameters, added the following:
For Key, added --additional-python-modules.
For Value, added…

cloud_hari
- 147
- 1
- 8
2
votes
2 answers
GlueJobRunnerSession is not authorized to perform: lakeformation:GetDataAccess on resource
I am trying to use glueContext.purge_table function in my aws glue job. Whenever the job is executed it throws the following error:
An error occurred while calling o82.purgeTable.
: java.lang.RuntimeException: class…

Nabeel Khan Ghauri
- 125
- 1
- 4
- 15
2
votes
1 answer
Pyspark dataframe remove duplicate in AWS Glue Script
I have a script in AWS Glue ETL Job, where it reads a S3 bucket with a lot of parquet files, do a sort by key1, key2 and a timestamp field. After that the script delete the duplicates and save a single parquet file in other S3 Bucket.
Look the data…

Murillo Mamud
- 119
- 1
- 8
2
votes
0 answers
Overwrite mode in spark causing issues
I am running an AWS Pyspark Glue Job where I am reading the S3 raw path where the data has been loaded from Redshift and I am doing some transformations on top of it. Below is my code:
data = spark.read.parquet(rawPath) # complete dataset.…

whatsinthename
- 1,828
- 20
- 59
2
votes
1 answer
Is it possible to read fixed length file in AWS Glue directly without using crawler?
Is it possible to read fixed length file in AWS Glue using DynamicFrameReader from_options without using Crawlers?
I found the below solution using spark but is there a way to do this in Glue directly ?
pyspark parse fixed width text file

Aji C S
- 71
- 7
2
votes
2 answers
End/exit a glue job programmatically
I am using Glue bookmarking to process data. My job is scheduled every day, but can also be launch "manually". Since I use bookmarks, sometimes the Glue job can start without having new data to process, the read dataframe is then empty. In this…

Jérémy
- 1,790
- 1
- 24
- 40
2
votes
1 answer
AWS Glue - Convert the Json response from GET(REST API) request to DataFrame/DyanamicFramce and store it in s3 bucket
headersAPI = {
'Content-Type': 'application/json'
, 'accept': 'application/json'
,'Authorization': 'Bearer…

Chandar
- 31
- 4
2
votes
1 answer
How to join / concatenate / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
I have a protocol that needs to take in many (read millions) of records. The protocol requires all of the data is a single line feed (InfluxDB / QuestDB). Using the InfluxDB client isn't currently an option so I need to do this via a socket.
I am at…

the1dv
- 893
- 7
- 14
2
votes
0 answers
Nullpointer Exception on processing Glue job
I am facing a problem with AWS Glue. The code imports two dataframes from 100s of small parquet files, using:
context.create_dynamic_frame_from_options(...)
The process completes successfully and the data is cleaned with null/duplicate values…

Jaco Van Niekerk
- 4,180
- 2
- 21
- 48
2
votes
1 answer
find or recover deleted AWS glue job
I have accidentally deleted an AWS Glue job but I don't remember which one. Can I check from some logs what job I deleted? and recover it?
user13067694