Highest Voted 'aws-glue-spark' Questions

0

votes

1 answer

AWS Glue E2E implementation and architecture

Currently we are in process of migration from legacy on premise Data warehousing solution [based on IBM datastage ] to cloud based solution . We have files with incremental information from heterogeneous sources which we require to load into our…

aws-glue aws-glue-spark

asked May 24 '21 at 10:10

dashsidd1

16
1

0

votes

1 answer

AWS Glue ETL to Redshift: DATE

amazon-web-services amazon-redshift aws-glue aws-glue-spark

asked May 14 '21 at 17:25

TechNewbie

164
2
15

0

votes

1 answer

NoClassDefFoundError: scala/Product$class while calling aws-glue libraries in scala code on local. same jar works on as a glue job on aws

I am using Spark with scala, I am also using aws glue libraries as well for glue script. When i am using scala version 2.12 I am getting this error. error with version 2.12 import com.amazonaws.services.glue.{DataSource, DynamicFrame,…

scala apache-spark aws-glue aws-glue-spark

asked May 12 '21 at 10:33

Abhishek Kumar

1
3

0

votes

0 answers

AWS Glue - facing memory issue in join and storing it in S3

I have around 200 compressed JSON lines files each of 500MB stored in S3. Each JSON line has empid, empname and array column which holds raw resume text content. Each file 2500 JSON lines. I'm creating a dynamic frame, f1 =…

pyspark aws-glue aws-glue-spark

asked May 04 '21 at 08:37

Sathishkumar Jayaraj

736
11
29

0

votes

1 answer

Using arguments with Glue pyspark

Intro I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial. I used the "hellowrold.py": import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import…

python pyspark aws-glue spark-submit aws-glue-spark

asked Apr 29 '21 at 14:56

Jérémy

1,790
1
24
40

0

votes

1 answer

PySpark SQL - Nested array conditional select into a new column

python apache-spark pyspark apache-spark-sql aws-glue-spark

asked Apr 27 '21 at 19:19

akpatch

1
3

0

votes

1 answer

aws glue drop partition using spark sql

drop partition using spark sql frm glue metadata is throwing issues while same code works in hive shell. **Hive shell** hive> alter table prc_db.detl_stg drop IF EXISTS partition(prc_name="dq") ; OK Time taken: 1.013 seconds **spark…

amazon-emr aws-glue-data-catalog aws-glue-spark

asked Apr 26 '21 at 14:06

user3858193

1,320
5
18
50

0

votes

1 answer

How to reduce a week from Rundate in AWS Glue Pyspark

I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format. Lets say 2021-04-19. Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")' But now i want to create 2 variables out of it…

pyspark apache-spark-sql aws-glue aws-glue-data-catalog aws-glue-spark

asked Apr 22 '21 at 20:11

Beginner

71
1
3
10

0

votes

1 answer

Glue - Bookmark doesn't recognize files in newer partitions

I have a glue job that reads from an S3 bucket does transformations and uploads the result in another S3 bucket. Here's what my aws glue get-job-bookmark --job-name xx returns JobBookmark":…

amazon-web-services amazon-s3 aws-glue aws-glue-data-catalog aws-glue-spark

asked Apr 20 '21 at 23:47

user3726933

329
2
17

0

votes

1 answer

Remove last delimeter from a .TXT file in Pyspark

I have an S3 file generated from the different system which is as below: A1|~|B1|~|C1|~|D1|~| A2|~|B2|~|C2|~|D2|~| A3|~|B3|~|C3|~|D3|~| A4|~|B4|~|C4|~|D4|~| Now while reading this file in AWS Glue Pyspark script, I want to remove the last…

amazon-web-services amazon-s3 pyspark aws-glue aws-glue-spark

asked Apr 17 '21 at 12:33

Beginner

71
1
3
10

0

votes

2 answers

Matching up arrays in PySpark

I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job. df1: item tag 1 AB 2 CD 3 EF 4 QQ df2: key1 key2 tags A1 B1 [AB] A1 B2 [AB, CD, EF] A2 B1 [CD, EF] A2…

apache-spark pyspark apache-spark-sql aws-glue aws-glue-spark

asked Mar 19 '21 at 07:56

Jaco Van Niekerk

4,180
2
21
48

0

votes

1 answer

Consolidating Many Data Files into One Using Glue - Job Succeeds But Without Output Files

TL;DR I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job Input data is Catalogued in Glue and queryable via Athena Glue Job runs with "Succeeded" output status, but no output files are created Details Input…

amazon-web-services aws-glue aws-glue-data-catalog aws-glue-spark

asked Mar 13 '21 at 18:40

Matt

907
1
8
17

0

votes

1 answer

Is it possible to stream AWS cloudwatch logs

I know its is possible to stream CloudWatch Logs Data to Amazon Elasticsearch Service. It is documented here, But is it possible to stream the logs data to a custom AWS Glue Job, or to an EMR Job?

amazon-web-services aws-glue amazon-cloudwatchlogs aws-glue-spark

asked Mar 10 '21 at 02:30

Jimson James

2,937
6
43
78

0

votes

1 answer

How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?

I tried model.save(sc, path) it gves me error : TypeError: save() takes 2 positional arguments but 3 were given. Here sc is the sparkcontext [sc = SparkContext()] I tried without sc in the signature but got this error : An error occurred while…

amazon-web-services amazon-s3 etl aws-glue aws-glue-spark

asked Mar 08 '21 at 17:05

Manaal Soni

73
12

0

votes

1 answer

run glue job only when data is updated

I have a glue job that transfers data from S3 to Redshift. I want it to schedule it such that it runs everytime when the data in S3 is reuploaded or updated. How can I do so? I tried the code sol here and made a lambda function: How to Trigger Glue…

amazon-web-services amazon-s3 aws-lambda aws-glue aws-glue-spark

asked Mar 03 '21 at 17:30

user13067694

Questions tagged [aws-glue-spark]