Questions tagged [aws-glue-spark]

244 questions
2
votes
1 answer

Non-Partitioned Table Schema not updated with Glue ETL Job

We have an ETL job that uses the below code snippet to update the catalog table: sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True,…
2
votes
0 answers

Glue secret manager integration: secretId is not provided

I am running the glue pyspark script from my local machine using the GlueETL library. When creating a dataframe from glue catalog, dyf_user_book_reading_stat = glueContext.create_dynamic_frame.from_catalog( database="xxx-db", …
2
votes
1 answer

AWS Glue - Job Monitoring: Job Execution, Active Executors and Maximum Needed Executors not showing

I have set up an ETL job in AWS Glue with the following settings: Glue v.3.0, Python v.3, Spark v.3.1 and Worker type G.1X with 10 Workers and Job metrics enabled. When I'm looking at the job metrics after the job is finished, I see in the Job…
Qwaz
  • 199
  • 9
2
votes
1 answer

Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@e7b95c9 and DynamicNode: longnode

I am trying to extract 27 DynamoDB tables from a single Database using the Visual editor in AWS Glue. I have successfully crawled the database and my workflow for the job is. Extract from Source table (DynamoDB). Apply Transform (usually 1:1 and…
2
votes
0 answers

AWS Gluescript missing a Parquet file

AWS Gluescript written in pyspark usually works great, creates Parquet files, but occasionally I am missing a Parquet file. How can I ensure / mitigate missing data? pertinent code is: FinalDF.write.partitionBy("Year",…
Judy K
  • 31
  • 2
2
votes
1 answer

Unable to add/import additional python library datacompy in aws glue

i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step Open the AWS Glue console. Under Job parameters, added the following: For Key, added --additional-python-modules. For Value, added…
cloud_hari
  • 147
  • 1
  • 8
2
votes
2 answers

GlueJobRunnerSession is not authorized to perform: lakeformation:GetDataAccess on resource

I am trying to use glueContext.purge_table function in my aws glue job. Whenever the job is executed it throws the following error: An error occurred while calling o82.purgeTable. : java.lang.RuntimeException: class…
2
votes
1 answer

Pyspark dataframe remove duplicate in AWS Glue Script

I have a script in AWS Glue ETL Job, where it reads a S3 bucket with a lot of parquet files, do a sort by key1, key2 and a timestamp field. After that the script delete the duplicates and save a single parquet file in other S3 Bucket. Look the data…
2
votes
0 answers

Overwrite mode in spark causing issues

I am running an AWS Pyspark Glue Job where I am reading the S3 raw path where the data has been loaded from Redshift and I am doing some transformations on top of it. Below is my code: data = spark.read.parquet(rawPath) # complete dataset.…
2
votes
1 answer

Is it possible to read fixed length file in AWS Glue directly without using crawler?

Is it possible to read fixed length file in AWS Glue using DynamicFrameReader from_options without using Crawlers? I found the below solution using spark but is there a way to do this in Glue directly ? pyspark parse fixed width text file
Aji C S
  • 71
  • 7
2
votes
2 answers

End/exit a glue job programmatically

I am using Glue bookmarking to process data. My job is scheduled every day, but can also be launch "manually". Since I use bookmarks, sometimes the Glue job can start without having new data to process, the read dataframe is then empty. In this…
Jérémy
  • 1,790
  • 1
  • 24
  • 40
2
votes
1 answer

AWS Glue - Convert the Json response from GET(REST API) request to DataFrame/DyanamicFramce and store it in s3 bucket

headersAPI = { 'Content-Type': 'application/json' , 'accept': 'application/json' ,'Authorization': 'Bearer…
2
votes
1 answer

How to join / concatenate / merge all rows of an RDD in PySpark / AWS Glue into one single long line?

I have a protocol that needs to take in many (read millions) of records. The protocol requires all of the data is a single line feed (InfluxDB / QuestDB). Using the InfluxDB client isn't currently an option so I need to do this via a socket. I am at…
the1dv
  • 893
  • 7
  • 14
2
votes
0 answers

Nullpointer Exception on processing Glue job

I am facing a problem with AWS Glue. The code imports two dataframes from 100s of small parquet files, using: context.create_dynamic_frame_from_options(...) The process completes successfully and the data is cleaned with null/duplicate values…
Jaco Van Niekerk
  • 4,180
  • 2
  • 21
  • 48
2
votes
1 answer

find or recover deleted AWS glue job

I have accidentally deleted an AWS Glue job but I don't remember which one. Can I check from some logs what job I deleted? and recover it?
user13067694
1 2
3
16 17