Highest Voted 'aws-glue-spark' Questions

0

votes

2 answers

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table…

asked Oct 29 '21 at 12:31

RJ7

9
5

0

votes

1 answer

How to dynamically specify s3 path using glue?

I am writing some files from relational database source to s3 using glue. I would like the s3 path to be in this format bucket_name/database/schema/table/year/month/day format. I am reading the bucket_name, database, schema, table name from a…

amazon-web-services amazon-s3 aws-glue-spark aws-glue-workflow

asked Oct 28 '21 at 11:09

RJ7

9
5

0

votes

0 answers

job aborted while writing parquet to s3 via glue jobs

My code looks like below which consists of transformations: dictionaryDf = spark.read.option("header", "true").csv( "s3://...../.csv") web_notif_data = fullLoad.cache() …

amazon-web-services apache-spark pyspark aws-glue aws-glue-spark

asked Oct 23 '21 at 17:23

whatsinthename

1,828
20
59

0

votes

1 answer

How to set a specific compression value in aws glue? If possible, can the compression level and partitions be determined manually in aws glue?

I am looking to ingest data from a source to s3 using AWS Glue. Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided?…

amazon-web-services pyspark aws-glue aws-glue-spark aws-glue-workflow

asked Oct 22 '21 at 11:12

RJ7

9
5

0

votes

1 answer

How to rename output files written by aws glue script to a s3 location? using pyspark

I am looking to rename the output files written to s3 using aws glue in pyspark. If there's a code to refer to renaming files in s3 after the glue job run, that would be really helpful

amazon-web-services pyspark aws-glue aws-glue-spark aws-glue-workflow

asked Oct 22 '21 at 07:21

RJ7

9
5

0

votes

1 answer

Partition the data frame using column X and writes the data without column X

How can I write the partition by column X and writes the data without Column X values? I had a data frame with two-columns and the values are as shown below. pkey string, output_value string Values as pkey ===== output_value 100 =====…

apache-spark databricks cloudera hortonworks-data-platform aws-glue-spark

asked Oct 10 '21 at 07:14

user3404493

31
6

0

votes

0 answers

transformation_ctx value is not stored for the incremental purpose in the glue job temp dir

I am trying to load incremental data from Redshift to the s3. I have set up redshift_temp_dir and temp dir for glue jobs(using glue console). Below is my code: my_conn_options = { "url": "", "dbtable": "", "user":…

amazon-web-services apache-spark aws-glue aws-glue-spark

asked Oct 04 '21 at 19:41

whatsinthename

1,828
20
59

0

votes

1 answer

How to filter bad records while writing to a RDS (Postgre) table via Glue ETL job

I am doing Glue ETL processing which basically does the following - Read a file from S3 (via Glue Catalog) Transfer the data (add/delete columns) Write data to RDS postgre table (Also via Glue Catalog) args = getResolvedOptions(sys.argv,…

postgresql amazon-web-services aws-glue aws-glue-data-catalog aws-glue-spark

asked Oct 02 '21 at 12:22

Vishal Mishra

3
3

0

votes

0 answers

I am reading two parquet files in same folder . both of them have same columns but few columns datatype is not matching

Unable to load same columns from two files under one folder . Few of the columns are bigint in one file and in another file they are double.Due to which i am facing error in reading files on folder level in Glue using pyspark

amazon-s3 pyspark aws-glue aws-glue-spark

asked Sep 29 '21 at 12:31

Iram

21
6

0

votes

1 answer

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method: def readFromRedShift(spark: SparkSession, schema, tablename): table = str(schema) + str(".") +…

apache-spark pyspark amazon-redshift aws-glue aws-glue-spark

asked Sep 28 '21 at 20:26

whatsinthename

1,828
20
59

0

votes

1 answer

Pyspark turn a list to a dictionary in a specific column

I have a spark dataframe that looks like this in json; { "site_id": "ABC", "region": "Texas", "areas": [ { "Carbon": [ "ABB", "ABD", "ABE" ] } ], "site_name": "ABC" } and I need to turn "areas"…

pyspark aws-glue aws-glue-spark

asked Sep 21 '21 at 13:56

akyayik

664
1
8
25

0

votes

1 answer

Data truncation error in aws glue job while transferring data from S3 to Aurora

I am trying to transfer my data from S3 bucket (address.csv) to AWS Aurora (MySQL) using AWS Glue. When I use the following script for transfer, one of the column named "po_box_number" which is a varchar with length 10 gives me a error saying "An…

python amazon-web-services aws-glue-data-catalog aws-glue-spark aws-glue-workflow

asked Sep 07 '21 at 20:36

parth222

1

0

votes

0 answers

how to transfer data from documentdb to redshift using AWS glue (ETL jobs)

I am currently transferring data from Documentdb to s3 but not able to transfer to redshift, maybe I missed something, so can you please specify steps to follow so that I can transfer to redshift.

amazon-web-services pyspark aws-glue aws-glue-spark

asked Aug 30 '21 at 17:22

aditya Damera

100
1
5

0

votes

1 answer

How to pass Glue annotations for multiple inputs in AWS for multiple input source

Glue diagram is generated as per the annotations passed to it and edges are created as per @input frame value passed, I want to generate diagram where it should take multiple inputs as there should be multiple edges coming to vertex for each source…

amazon-web-services annotations cloud aws-glue aws-glue-spark

asked Aug 29 '21 at 16:14

gunish jha

11
5

0

votes

0 answers

Glue Data write to Redshift too slow

I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files. first question: Its taking a lot of time to write data to Redshift from glue even tho I am running 10 DPUs second: How can I make it more…

pyspark amazon-redshift aws-glue aws-glue-spark

asked Aug 24 '21 at 19:36

Aditya Verma

201
4
14

Questions tagged [aws-glue-spark]