Highest Voted 'aws-glue-spark' Questions

1

vote

1 answer

How to convert GUID into integer in pyspark

Hi Stackoverflow fams: I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an…

asked May 14 '21 at 15:31

Sisay

29
5

1

vote

1 answer

Nullpointerexception in AWS Glue on dataframe_obj.count()

Good day I am writing a Glue job on AWS to transform data. After doing a join on two sets of data (resulting in a dataframe of around 100MB in size), I get a Nullpointer exception when retrieving the count on the dataframe. What makes this bug…

amazon-web-services aws-glue-spark

asked May 07 '21 at 08:59

Jaco Van Niekerk

4,180
2
21
48

1

vote

0 answers

Glue ETL job- Reading data from onpremise database- using catalog connection

I have a glue ETL job which write data to an onpremise postgreSql database. I'm unable to find an effective option within glue methods to read the data from same database using the jdbc connection. Below is the existing approach: Reads data from…

python-3.x amazon-web-services pyspark aws-glue aws-glue-spark

asked Apr 23 '21 at 14:31

srikanth A

11
1

1

vote

1 answer

AWS Glue with PySpark - DynamicFrame export to S3 fails partway through with UnsupportedOperationException

I should preface this by saying I've been using AWS Glue Studio to learn how to use Glue with PySpark, and so far it's been going really well. That was until I encountered an error which I cannot understand (let alone solve). An example of the data…

amazon-web-services apache-spark pyspark aws-glue aws-glue-spark

asked Mar 30 '21 at 13:52

Jamie

1,530
1
19
35

1

vote

1 answer

AWS Glue - DynamicFrame with varying schema in json files

Sample: I have a partitioned table with DDL below in Glue catalog: CREATE EXTERNAL TABLE `test`( `id` int, `data` struct) PARTITIONED BY ( `partition_0` string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'…

apache-spark pyspark aws-glue aws-glue-spark

asked Mar 02 '21 at 21:45

Leonid

11
2

1

vote

2 answers

AWS Glue - Replacing field names containing "." with "_"

I am trying to replace all the fields which have "." within the field name to "_". This is what I have: def apply_renaming_mapping(df): """Given a dynamic data frame, if the field contains ., replace with _""" # construct renaming mapping…

python aws-glue aws-glue-spark

asked Mar 01 '21 at 22:50

molly_567

113
3

1

vote

0 answers

Missing executor CloudWatch logs for AWS Glue version 2.0 ETL job

I am running glueetl (Glue version 2.0) job using Python with the below configuration for logging. I have continuous logging enabled. I get INFO entries from the driver present in /aws-glue/jobs/output log group, however there are no INFO entries…

amazon-web-services amazon-cloudwatch aws-glue amazon-cloudwatchlogs aws-glue-spark

asked Feb 18 '21 at 16:54

Krzysztof Słowiński

6,239
8
44
62

1

vote

2 answers

AWS Glue ETL Spark- string to timestamp

I am trying to convert my CSVs to Parquet via AWS Glue ETL Job. At the same time, I am willing to convert my datetime column (string) to timestamp format that Athena can recognize. (Athena recognizes this yyyy-MM-dd HH:mm:ss) I skimmed and applied…

parquet aws-glue string-to-datetime aws-glue-spark

asked Feb 12 '21 at 10:34

Omur

136
1
7

1

vote

1 answer

add missing column to AWS Glue DataFrame

I am reading a DynamoDB Table with Glue, due to the dynamic schema it can happen that some columns are not existing. Adding them works fine with the following code but I am not sure how to make the function dynamic if I need to add multiple…

aws-glue pyspark aws-glue-spark

asked Feb 08 '21 at 13:35

Tobias Bruckert

348
2
12

1

vote

0 answers

AWS Glue Connection To RDS Mysql v8

I am connecting my AWS Glue to a RDS instance in external account. These are my code, and I have done VPC peering, open all TCP ports and public accessibility. (i have another rds running on MySQL v5 in the same VPC in external account and the glue…

python mysql amazon-web-services aws-glue aws-glue-spark

asked Feb 04 '21 at 05:28

Haoyu Quan

11
1

1

vote

0 answers

Running AWS Glue ETL Job (Spark) for large data

Currently, I have a GLUE ETL Script in Scala. Following are my GLUE script settings: Spark 2.4, Scala 2 (Glue Version 2.0) Worker type : G1.X (Recommended for memory intensive job) Number of workers : 10 I am reading 60 GB data in the database…

amazon-web-services scala apache-spark aws-glue aws-glue-spark

asked Feb 03 '21 at 23:41

2shar

101
1
11

1

vote

0 answers

What is the relation between hashpartitions and no of worker when we are using from_options aws glue?

I have created glue job to read the data from oracle by using below code. WhereQuery="select * from test where dated>==CURRENT_DATE-4 connection_oracle11_options = { "url": URL, "dbtable": tableName, "user": USERNAME, "password":…

oracle amazon-web-services amazon-s3 aws-glue aws-glue-spark

asked Feb 03 '21 at 15:24

Sai

1,075
5
31
58

1

vote

1 answer

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

I have this code snippet that I ran locally in standalone mode using 100 records only: from awsglue.context import GlueContext glue_context = GlueContext(sc) glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table) df…

apache-spark pyspark aws-glue aws-glue-spark spark-ui

asked Jan 27 '21 at 02:22

pyspark-developer

57
6

1

vote

1 answer

Is it possible to write each aws glue dynamicrecord to different s3 path

I am new AWS glue. I need to write each record in a dynamic frame to a custom folder path in s3. For example Following is the target s3 path: /parentfolder/////.json Here, 'year', 'month',…

pyspark aws-glue aws-glue-spark

asked Jan 21 '21 at 13:32

Karthik

55
1
8

1

vote

0 answers

How to create a CASE or IF nested to change data values in a column

I need to update data values from a column in aws glue, im looking for something like a CASE sentence or IF ELSE nested. Example: CASE WHEN dc.activo = 0 OR dc."max" < 100 THEN 'Inactivo' WHEN dc.estadoRegistro = 0 THEN 'Activo, registro…

aws-glue aws-glue-data-catalog aws-glue-spark aws-glue-workflow

asked Dec 29 '20 at 00:40

Tavo Vega

13
2

Questions tagged [aws-glue-spark]