Questions tagged [aws-glue-spark]
244 questions
1
vote
1 answer
How to convert GUID into integer in pyspark
Hi Stackoverflow fams:
I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an…

Sisay
- 29
- 5
1
vote
1 answer
Nullpointerexception in AWS Glue on dataframe_obj.count()
Good day
I am writing a Glue job on AWS to transform data. After doing a join on two sets of data (resulting in a dataframe of around 100MB in size), I get a Nullpointer exception when retrieving the count on the dataframe. What makes this bug…

Jaco Van Niekerk
- 4,180
- 2
- 21
- 48
1
vote
0 answers
Glue ETL job- Reading data from onpremise database- using catalog connection
I have a glue ETL job which write data to an onpremise postgreSql database. I'm unable to find an effective option within glue methods to read the data from same database using the jdbc connection.
Below is the existing approach:
Reads data from…

srikanth A
- 11
- 1
1
vote
1 answer
AWS Glue with PySpark - DynamicFrame export to S3 fails partway through with UnsupportedOperationException
I should preface this by saying I've been using AWS Glue Studio to learn how to use Glue with PySpark, and so far it's been going really well. That was until I encountered an error which I cannot understand (let alone solve). An example of the data…

Jamie
- 1,530
- 1
- 19
- 35
1
vote
1 answer
AWS Glue - DynamicFrame with varying schema in json files
Sample:
I have a partitioned table with DDL below in Glue catalog:
CREATE EXTERNAL TABLE `test`(
`id` int,
`data` struct)
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'…

Leonid
- 11
- 2
1
vote
2 answers
AWS Glue - Replacing field names containing "." with "_"
I am trying to replace all the fields which have "." within the field name to "_".
This is what I have:
def apply_renaming_mapping(df):
"""Given a dynamic data frame, if the field contains ., replace with _"""
# construct renaming mapping…

molly_567
- 113
- 3
1
vote
0 answers
Missing executor CloudWatch logs for AWS Glue version 2.0 ETL job
I am running glueetl (Glue version 2.0) job using Python with the below configuration for logging. I have continuous logging enabled.
I get INFO entries from the driver present in /aws-glue/jobs/output log group, however there are no INFO entries…

Krzysztof Słowiński
- 6,239
- 8
- 44
- 62
1
vote
2 answers
AWS Glue ETL Spark- string to timestamp
I am trying to convert my CSVs to Parquet via AWS Glue ETL Job. At the same time, I am willing to convert my datetime column (string) to timestamp format that Athena can recognize. (Athena recognizes this yyyy-MM-dd HH:mm:ss)
I skimmed and applied…

Omur
- 136
- 1
- 7
1
vote
1 answer
add missing column to AWS Glue DataFrame
I am reading a DynamoDB Table with Glue, due to the dynamic schema it can happen that some columns are not existing.
Adding them works fine with the following code but I am not sure how to make the function dynamic if I need to add multiple…

Tobias Bruckert
- 348
- 2
- 12
1
vote
0 answers
AWS Glue Connection To RDS Mysql v8
I am connecting my AWS Glue to a RDS instance in external account. These are my code, and I have done VPC peering, open all TCP ports and public accessibility. (i have another rds running on MySQL v5 in the same VPC in external account and the glue…

Haoyu Quan
- 11
- 1
1
vote
0 answers
Running AWS Glue ETL Job (Spark) for large data
Currently, I have a GLUE ETL Script in Scala.
Following are my GLUE script settings:
Spark 2.4, Scala 2 (Glue Version 2.0)
Worker type : G1.X (Recommended for memory intensive job)
Number of workers : 10
I am reading 60 GB data in the database…

2shar
- 101
- 1
- 11
1
vote
0 answers
What is the relation between hashpartitions and no of worker when we are using from_options aws glue?
I have created glue job to read the data from oracle by using below code.
WhereQuery="select * from test where dated>==CURRENT_DATE-4
connection_oracle11_options = {
"url": URL,
"dbtable": tableName,
"user": USERNAME,
"password":…

Sai
- 1,075
- 5
- 31
- 58
1
vote
1 answer
Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?
I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df…

pyspark-developer
- 57
- 6
1
vote
1 answer
Is it possible to write each aws glue dynamicrecord to different s3 path
I am new AWS glue. I need to write each record in a dynamic frame to a custom folder path in s3.
For example
Following is the target s3 path:
/parentfolder/////.json
Here, 'year', 'month',…

Karthik
- 55
- 1
- 8
1
vote
0 answers
How to create a CASE or IF nested to change data values in a column
I need to update data values from a column in aws glue, im looking for something like a CASE sentence or IF ELSE nested.
Example:
CASE WHEN dc.activo = 0 OR dc."max" < 100 THEN 'Inactivo'
WHEN dc.estadoRegistro = 0 THEN 'Activo, registro…

Tavo Vega
- 13
- 2