Questions tagged [aws-glue-spark]
244 questions
0
votes
2 answers
loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?
I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table…

RJ7
- 9
- 5
0
votes
1 answer
How to dynamically specify s3 path using glue?
I am writing some files from relational database source to s3 using glue. I would like the s3 path to be in this format bucket_name/database/schema/table/year/month/day format.
I am reading the bucket_name, database, schema, table name from a…

RJ7
- 9
- 5
0
votes
0 answers
job aborted while writing parquet to s3 via glue jobs
My code looks like below which consists of transformations:
dictionaryDf = spark.read.option("header", "true").csv(
"s3://...../.csv")
web_notif_data = fullLoad.cache()
…

whatsinthename
- 1,828
- 20
- 59
0
votes
1 answer
How to set a specific compression value in aws glue? If possible, can the compression level and partitions be determined manually in aws glue?
I am looking to ingest data from a source to s3 using AWS Glue.
Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided?…

RJ7
- 9
- 5
0
votes
1 answer
How to rename output files written by aws glue script to a s3 location? using pyspark
I am looking to rename the output files written to s3 using aws glue in pyspark.
If there's a code to refer to renaming files in s3 after the glue job run, that would be really helpful

RJ7
- 9
- 5
0
votes
1 answer
Partition the data frame using column X and writes the data without column X
How can I write the partition by column X and writes the data without Column X values?
I had a data frame with two-columns and the values are as shown below.
pkey string, output_value string
Values as
pkey ===== output_value
100 =====…

user3404493
- 31
- 6
0
votes
0 answers
transformation_ctx value is not stored for the incremental purpose in the glue job temp dir
I am trying to load incremental data from Redshift to the s3. I have set up redshift_temp_dir and temp dir for glue jobs(using glue console).
Below is my code:
my_conn_options = {
"url": "",
"dbtable": "",
"user":…

whatsinthename
- 1,828
- 20
- 59
0
votes
1 answer
How to filter bad records while writing to a RDS (Postgre) table via Glue ETL job
I am doing Glue ETL processing which basically does the following -
Read a file from S3 (via Glue Catalog)
Transfer the data (add/delete columns)
Write data to RDS postgre table (Also via Glue Catalog)
args = getResolvedOptions(sys.argv,…

Vishal Mishra
- 3
- 3
0
votes
0 answers
I am reading two parquet files in same folder . both of them have same columns but few columns datatype is not matching
Unable to load same columns from two files under one folder . Few of the columns are bigint in one file and in another file they are double.Due to which i am facing error in reading files on folder level in Glue using pyspark

Iram
- 21
- 6
0
votes
1 answer
Incremental data load from Redshift to S3 using Pyspark and Glue Jobs
I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") +…

whatsinthename
- 1,828
- 20
- 59
0
votes
1 answer
Pyspark turn a list to a dictionary in a specific column
I have a spark dataframe that looks like this in json;
{
"site_id": "ABC",
"region": "Texas",
"areas": [
{
"Carbon": [
"ABB",
"ABD",
"ABE"
]
}
],
"site_name": "ABC"
}
and I need to turn "areas"…

akyayik
- 664
- 1
- 8
- 25
0
votes
1 answer
Data truncation error in aws glue job while transferring data from S3 to Aurora
I am trying to transfer my data from S3 bucket (address.csv) to AWS Aurora (MySQL) using AWS Glue. When I use the following script for transfer, one of the column named "po_box_number" which is a varchar with length 10 gives me a error saying "An…

parth222
- 1
0
votes
0 answers
how to transfer data from documentdb to redshift using AWS glue (ETL jobs)
I am currently transferring data from Documentdb to s3 but not able to transfer to redshift, maybe I missed something, so can you please specify steps to follow so that I can transfer to redshift.

aditya Damera
- 100
- 1
- 5
0
votes
1 answer
How to pass Glue annotations for multiple inputs in AWS for multiple input source
Glue diagram is generated as per the annotations passed to it and edges are created as per @input frame value passed, I want to generate diagram where it should take multiple inputs as there should be multiple edges coming to vertex for each source…

gunish jha
- 11
- 5
0
votes
0 answers
Glue Data write to Redshift too slow
I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files.
first question:
Its taking a lot of time to write data to Redshift from glue even tho I am running 10 DPUs
second:
How can I make it more…

Aditya Verma
- 201
- 4
- 14