Questions tagged [aws-glue-spark]

244 questions
1
vote
1 answer

Threading in AWS Glue

I have a piece of code that creates several threads on a Glue job like this: threads = [] for data_chunk in data_chunks: json_data = get_bulk_upload_json(data_chunk) …
rodrigocf
  • 1,951
  • 13
  • 39
  • 62
1
vote
0 answers

Record larger than the Split size in AWS GLUE?

I'm Newbie in AWS Glue and Spark. I build my ETL in this. When connect my s3 with files of 200mb approximately not read this. The error is that An error was encountered: An error occurred while calling o99.toDF. : org.apache.spark.SparkException:…
1
vote
1 answer

Cast Issue with AWS Glue 3.0 - Pyspark

I'm using Glue 3.0 data = [("Java", "6241499.16943521594684385382059800664452")] rdd = spark.sparkContext.parallelize(data) df = rdd.toDF() df.show() df.select(f.col("_2").cast("decimal(15,2)")).show() I get the following…
Smaillns
  • 2,540
  • 1
  • 28
  • 40
1
vote
2 answers

AWS glue NoClassDefFoundError on job.init()

Trying to debug AWS Glue scripts locally using Glue ETL library. I have installed aws-glue-libs and spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz. When I run job.init(), I get the following error trace: py4j.protocol.Py4JJavaError: An error occurred while…
sheetal_158
  • 7,391
  • 6
  • 27
  • 44
1
vote
0 answers

Exception: "SparkContext should only be created and accessed on the driver" while trying foreach()

Being new to Spark, I need to read data from MySQL DB, and then update(or upsert) rows in another table based on what I've read. AFAIK, unfortunately, there's no way I can do update with DataFrameWriter, so I want to try querying directly to the DB…
fracsinus
  • 11
  • 1
1
vote
0 answers

Trying to run pyspark code on docker image of aws_glue on mac

The following error I get. The code failed because of a fatal error. Some things to try: a) Make sure Spark has enough available resources for Jupyter to create a Spark context. b) Contact your Jupyter administrator to make sure the Spark magics…
1
vote
3 answers

How to capture data change in aws glue?

We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue? Note- We don't want to use AWS DMS.
1
vote
1 answer

Using custom connector in AWS Glue ETL script

I am working on an AWS Glue ETL script using the dynamic frame glue abstraction and writing code in python. I created a JDBC connection resource named sap-lpr-connection in the glue data catalog and would like to use it to retrieve the connection…
1
vote
0 answers

Read schema from Glue Schema Registry with Pyspark and validate records

I am trying to read schema from AWS schema registry and then validate data incoming from kafka topic.How can it done with gluescript?
1
vote
0 answers

Data import from MongoDB: duplicate columns

I'm trying to import data from mongoDB into AWS glue job and then to redshift, but when performing load from mongoDB I get this strange exception, is there a way to fix this issue? AnalysisException: Found duplicate column(s) in the data schema:…
1
vote
1 answer

Adding column to dataFrame

I need to add new column to DataFrame (DynamicFrame) based on json data from other column, what's the best way to do it? schema: 'id' 'name' 'customJson' -------------------------- 1 ,John, {'key':'lastName','value':'Smith'} after: 'id' 'name'…
1
vote
0 answers

AWS Glue ETL Job - Connection Refused error (Catalog Table as input)

I am trying to run a Glue ETL job which has a Glue Catalog table which has its data in S3, as input. I am getting the following error when running the job. The error seems to say that, it is unable to connect to the Spark instance but I am not sure…
1
vote
1 answer

AWS Glue null values are inserted on RDS as string

I created an AWS glue job that loads data from a CSV file to a Mysql RDS database. The data are loaded successfully but all NULL values were inserted in the MySQL table as strings, not as NULL. so if I query my table like select * from myTable where…
1
vote
0 answers

How to prevent spark query against CSV glue catalog source from including headers?

I am attempting to build a Glue job that will execute a SQL query against an existing glue catalog, and store the results in another glue catalog (in the example below, only return the record with the highest cost for each value of sn.) When…
1
vote
0 answers

How convert string to date when year have two digit in pyspark on aws glue

I have tried convert a string ddMMyy using to_date function to yyyyMMdd But the spark cast the date to 1900 year for exemple: I tried cast 150545 to 20450515 but got 19450515 #my_date = '150545' df = df.withColumn('sorce_format', lit('ddMMyy')) …