Questions tagged [aws-glue-spark]
244 questions
0
votes
1 answer
AWS Glue E2E implementation and architecture
Currently we are in process of migration from legacy on premise Data warehousing solution [based on IBM datastage ] to cloud based solution .
We have files with incremental information from heterogeneous sources which we require to load into our…

dashsidd1
- 16
- 1
0
votes
1 answer
AWS Glue ETL to Redshift: DATE
I am using AWS Glue to ETL data to Redshift. I have been encountering an issue where my date is loading as null in Redshift.
What I have set-up:
Upload csv into S3, see sample data:
item | color | price | date
shirt| brown | 25.05 |…

TechNewbie
- 164
- 2
- 15
0
votes
1 answer
NoClassDefFoundError: scala/Product$class while calling aws-glue libraries in scala code on local. same jar works on as a glue job on aws
I am using Spark with scala, I am also using aws glue libraries as well for glue script.
When i am using scala version 2.12 I am getting this error.
error with version 2.12
import com.amazonaws.services.glue.{DataSource, DynamicFrame,…

Abhishek Kumar
- 1
- 3
0
votes
0 answers
AWS Glue - facing memory issue in join and storing it in S3
I have around 200 compressed JSON lines files each of 500MB stored in S3. Each JSON line has empid, empname and array column which holds raw resume text content. Each file 2500 JSON lines. I'm creating a dynamic frame,
f1 =…

Sathishkumar Jayaraj
- 736
- 11
- 29
0
votes
1 answer
Using arguments with Glue pyspark
Intro
I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial.
I used the "hellowrold.py":
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import…

Jérémy
- 1,790
- 1
- 24
- 40
0
votes
1 answer
PySpark SQL - Nested array conditional select into a new column
I have the following schema:
root
|-- event_params: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string_value: string (nullable =…

akpatch
- 1
- 3
0
votes
1 answer
aws glue drop partition using spark sql
drop partition using spark sql frm glue metadata is throwing issues while same code works in hive shell.
**Hive shell**
hive> alter table prc_db.detl_stg drop IF EXISTS partition(prc_name="dq") ;
OK
Time taken: 1.013 seconds
**spark…

user3858193
- 1,320
- 5
- 18
- 50
0
votes
1 answer
How to reduce a week from Rundate in AWS Glue Pyspark
I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format.
Lets say 2021-04-19.
Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")'
But now i want to create 2 variables out of it…

Beginner
- 71
- 1
- 3
- 10
0
votes
1 answer
Glue - Bookmark doesn't recognize files in newer partitions
I have a glue job that reads from an S3 bucket does transformations and uploads the result in another S3 bucket.
Here's what my aws glue get-job-bookmark --job-name xx returns
JobBookmark":…

user3726933
- 329
- 2
- 17
0
votes
1 answer
Remove last delimeter from a .TXT file in Pyspark
I have an S3 file generated from the different system which is as below:
A1|~|B1|~|C1|~|D1|~|
A2|~|B2|~|C2|~|D2|~|
A3|~|B3|~|C3|~|D3|~|
A4|~|B4|~|C4|~|D4|~|
Now while reading this file in AWS Glue Pyspark script, I want to remove the last…

Beginner
- 71
- 1
- 3
- 10
0
votes
2 answers
Matching up arrays in PySpark
I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job.
df1:
item tag
1 AB
2 CD
3 EF
4 QQ
df2:
key1 key2 tags
A1 B1 [AB]
A1 B2 [AB, CD, EF]
A2 B1 [CD, EF]
A2…

Jaco Van Niekerk
- 4,180
- 2
- 21
- 48
0
votes
1 answer
Consolidating Many Data Files into One Using Glue - Job Succeeds But Without Output Files
TL;DR
I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
Input data is Catalogued in Glue and queryable via Athena
Glue Job runs with "Succeeded" output status, but no output files are created
Details
Input…

Matt
- 907
- 1
- 8
- 17
0
votes
1 answer
Is it possible to stream AWS cloudwatch logs
I know its is possible to stream CloudWatch Logs Data to Amazon Elasticsearch Service. It is documented here, But is it possible to stream the logs data to a custom AWS Glue Job, or to an EMR Job?

Jimson James
- 2,937
- 6
- 43
- 78
0
votes
1 answer
How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?
I tried model.save(sc, path) it gves me error : TypeError: save() takes 2 positional arguments but 3 were given. Here sc is the sparkcontext [sc = SparkContext()]
I tried without sc in the signature but got this error : An error occurred while…

Manaal Soni
- 73
- 12
0
votes
1 answer
run glue job only when data is updated
I have a glue job that transfers data from S3 to Redshift. I want it to schedule it such that it runs everytime when the data in S3 is reuploaded or updated. How can I do so? I tried the code sol here and made a lambda function: How to Trigger Glue…
user13067694