Questions tagged [aws-glue-spark]
244 questions
1
vote
1 answer
Threading in AWS Glue
I have a piece of code that creates several threads on a Glue job like this:
threads = []
for data_chunk in data_chunks:
json_data = get_bulk_upload_json(data_chunk)
…

rodrigocf
- 1,951
- 13
- 39
- 62
1
vote
0 answers
Record larger than the Split size in AWS GLUE?
I'm Newbie in AWS Glue and Spark.
I build my ETL in this.
When connect my s3 with files of 200mb approximately not read this.
The error is that
An error was encountered:
An error occurred while calling o99.toDF.
: org.apache.spark.SparkException:…

Vitualizz Uzumaki
- 11
- 2
1
vote
1 answer
Cast Issue with AWS Glue 3.0 - Pyspark
I'm using Glue 3.0
data = [("Java", "6241499.16943521594684385382059800664452")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
df.show()
df.select(f.col("_2").cast("decimal(15,2)")).show()
I get the following…

Smaillns
- 2,540
- 1
- 28
- 40
1
vote
2 answers
AWS glue NoClassDefFoundError on job.init()
Trying to debug AWS Glue scripts locally using Glue ETL library.
I have installed aws-glue-libs and spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz.
When I run job.init(), I get the following error trace:
py4j.protocol.Py4JJavaError: An error occurred while…

sheetal_158
- 7,391
- 6
- 27
- 44
1
vote
0 answers
Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()
Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read.
AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB…

fracsinus
- 11
- 1
1
vote
0 answers
Trying to run pyspark code on docker image of aws_glue on mac
The following error I get.
The code failed because of a fatal error. Some things to try: a) Make sure Spark has enough available resources for Jupyter to create a Spark context. b) Contact your Jupyter administrator to make sure the Spark magics…

sheetal_158
- 7,391
- 6
- 27
- 44
1
vote
3 answers
How to capture data change in aws glue?
We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?
Note- We don't want to use AWS DMS.

gourav vijayvargiya
- 21
- 1
- 2
1
vote
1 answer
Using custom connector in AWS Glue ETL script
I am working on an AWS Glue ETL script using the dynamic frame glue abstraction and writing code in python.
I created a JDBC connection resource named sap-lpr-connection in the glue data catalog and would like to use it to retrieve the connection…

LazyEval
- 769
- 1
- 8
- 22
1
vote
0 answers
Read schema from Glue Schema Registry with Pyspark and validate records
I am trying to read schema from AWS schema registry and then validate data incoming from kafka topic.How can it done with gluescript?

user3082928
- 71
- 7
1
vote
0 answers
Data import from MongoDB: duplicate columns
I'm trying to import data from mongoDB into AWS glue job and then to redshift, but when performing load from mongoDB I get this strange exception, is there a way to fix this issue?
AnalysisException: Found duplicate column(s) in the data schema:…

Miroslav Petrovic
- 69
- 6
1
vote
1 answer
Adding column to dataFrame
I need to add new column to DataFrame (DynamicFrame) based on json data from other column, what's the best way to do it?
schema:
'id' 'name' 'customJson'
--------------------------
1 ,John, {'key':'lastName','value':'Smith'}
after:
'id' 'name'…

Miroslav Petrovic
- 69
- 6
1
vote
0 answers
AWS Glue ETL Job - Connection Refused error (Catalog Table as input)
I am trying to run a Glue ETL job which has a Glue Catalog table which has its data in S3, as input.
I am getting the following error when running the job. The error seems to say that, it is unable to connect to the Spark instance but I am not sure…

Van
- 35
- 7
1
vote
1 answer
AWS Glue null values are inserted on RDS as string
I created an AWS glue job that loads data from a CSV file to a Mysql RDS database.
The data are loaded successfully but all NULL values were inserted in the MySQL table as strings, not as NULL.
so if I query my table like select * from myTable where…

adaso
- 61
- 5
1
vote
0 answers
How to prevent spark query against CSV glue catalog source from including headers?
I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When…

Brandon
- 11
- 1
1
vote
0 answers
How convert string to date when year have two digit in pyspark on aws glue
I have tried convert a string ddMMyy using to_date function to yyyyMMdd
But the spark cast the date to 1900 year
for exemple:
I tried cast 150545 to 20450515 but got 19450515
#my_date = '150545'
df = df.withColumn('sorce_format', lit('ddMMyy'))
…

Eriton Silva
- 129
- 1
- 10