Questions tagged [aws-glue-spark]

244 questions
1
vote
1 answer

How to convert GUID into integer in pyspark

Hi Stackoverflow fams: I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an…
1
vote
1 answer

Nullpointerexception in AWS Glue on dataframe_obj.count()

Good day I am writing a Glue job on AWS to transform data. After doing a join on two sets of data (resulting in a dataframe of around 100MB in size), I get a Nullpointer exception when retrieving the count on the dataframe. What makes this bug…
Jaco Van Niekerk
  • 4,180
  • 2
  • 21
  • 48
1
vote
0 answers

Glue ETL job- Reading data from onpremise database- using catalog connection

I have a glue ETL job which write data to an onpremise postgreSql database. I'm unable to find an effective option within glue methods to read the data from same database using the jdbc connection. Below is the existing approach: Reads data from…
1
vote
1 answer

AWS Glue with PySpark - DynamicFrame export to S3 fails partway through with UnsupportedOperationException

I should preface this by saying I've been using AWS Glue Studio to learn how to use Glue with PySpark, and so far it's been going really well. That was until I encountered an error which I cannot understand (let alone solve). An example of the data…
Jamie
  • 1,530
  • 1
  • 19
  • 35
1
vote
1 answer

AWS Glue - DynamicFrame with varying schema in json files

Sample: I have a partitioned table with DDL below in Glue catalog: CREATE EXTERNAL TABLE `test`( `id` int, `data` struct) PARTITIONED BY ( `partition_0` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'…
Leonid
  • 11
  • 2
1
vote
2 answers

AWS Glue - Replacing field names containing "." with "_"

I am trying to replace all the fields which have "." within the field name to "_". This is what I have: def apply_renaming_mapping(df): """Given a dynamic data frame, if the field contains ., replace with _""" # construct renaming mapping…
molly_567
  • 113
  • 3
1
vote
0 answers

Missing executor CloudWatch logs for AWS Glue version 2.0 ETL job

I am running glueetl (Glue version 2.0) job using Python with the below configuration for logging. I have continuous logging enabled. I get INFO entries from the driver present in /aws-glue/jobs/output log group, however there are no INFO entries…
1
vote
2 answers

AWS Glue ETL Spark- string to timestamp

I am trying to convert my CSVs to Parquet via AWS Glue ETL Job. At the same time, I am willing to convert my datetime column (string) to timestamp format that Athena can recognize. (Athena recognizes this yyyy-MM-dd HH:mm:ss) I skimmed and applied…
Omur
  • 136
  • 1
  • 7
1
vote
1 answer

add missing column to AWS Glue DataFrame

I am reading a DynamoDB Table with Glue, due to the dynamic schema it can happen that some columns are not existing. Adding them works fine with the following code but I am not sure how to make the function dynamic if I need to add multiple…
Tobias Bruckert
  • 348
  • 2
  • 12
1
vote
0 answers

AWS Glue Connection To RDS Mysql v8

I am connecting my AWS Glue to a RDS instance in external account. These are my code, and I have done VPC peering, open all TCP ports and public accessibility. (i have another rds running on MySQL v5 in the same VPC in external account and the glue…
1
vote
0 answers

Running AWS Glue ETL Job (Spark) for large data

Currently, I have a GLUE ETL Script in Scala. Following are my GLUE script settings: Spark 2.4, Scala 2 (Glue Version 2.0) Worker type : G1.X (Recommended for memory intensive job) Number of workers : 10 I am reading 60 GB data in the database…
1
vote
0 answers

What is the relation between hashpartitions and no of worker when we are using from_options aws glue?

I have created glue job to read the data from oracle by using below code. WhereQuery="select * from test where dated>==CURRENT_DATE-4 connection_oracle11_options = { "url": URL, "dbtable": tableName, "user": USERNAME, "password":…
Sai
  • 1,075
  • 5
  • 31
  • 58
1
vote
1 answer

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

I have this code snippet that I ran locally in standalone mode using 100 records only: from awsglue.context import GlueContext glue_context = GlueContext(sc) glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table) df…
1
vote
1 answer

Is it possible to write each aws glue dynamicrecord to different s3 path

I am new AWS glue. I need to write each record in a dynamic frame to a custom folder path in s3. For example Following is the target s3 path: /parentfolder/////.json Here, 'year', 'month',…
Karthik
  • 55
  • 1
  • 8
1
vote
0 answers

How to create a CASE or IF nested to change data values in a column

I need to update data values from a column in aws glue, im looking for something like a CASE sentence or IF ELSE nested. Example: CASE WHEN dc.activo = 0 OR dc."max" < 100 THEN 'Inactivo' WHEN dc.estadoRegistro = 0 THEN 'Activo, registro…