Questions tagged [aws-glue-spark]

244 questions
0
votes
1 answer

AWS Glue E2E implementation and architecture

Currently we are in process of migration from legacy on premise Data warehousing solution [based on IBM datastage ] to cloud based solution . We have files with incremental information from heterogeneous sources which we require to load into our…
dashsidd1
  • 16
  • 1
0
votes
1 answer

AWS Glue ETL to Redshift: DATE

I am using AWS Glue to ETL data to Redshift. I have been encountering an issue where my date is loading as null in Redshift. What I have set-up: Upload csv into S3, see sample data: item | color | price | date shirt| brown | 25.05 |…
0
votes
1 answer

NoClassDefFoundError: scala/Product$class while calling aws-glue libraries in scala code on local. same jar works on as a glue job on aws

I am using Spark with scala, I am also using aws glue libraries as well for glue script. When i am using scala version 2.12 I am getting this error. error with version 2.12 import com.amazonaws.services.glue.{DataSource, DynamicFrame,…
0
votes
0 answers

AWS Glue - facing memory issue in join and storing it in S3

I have around 200 compressed JSON lines files each of 500MB stored in S3. Each JSON line has empid, empname and array column which holds raw resume text content. Each file 2500 JSON lines. I'm creating a dynamic frame, f1 =…
0
votes
1 answer

Using arguments with Glue pyspark

Intro I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial. I used the "hellowrold.py": import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import…
Jérémy
  • 1,790
  • 1
  • 24
  • 40
0
votes
1 answer

PySpark SQL - Nested array conditional select into a new column

I have the following schema: root |-- event_params: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- key: string (nullable = true) | | |-- value: struct (nullable = true) | | | |-- string_value: string (nullable =…
0
votes
1 answer

aws glue drop partition using spark sql

drop partition using spark sql frm glue metadata is throwing issues while same code works in hive shell. **Hive shell** hive> alter table prc_db.detl_stg drop IF EXISTS partition(prc_name="dq") ; OK Time taken: 1.013 seconds **spark…
user3858193
  • 1,320
  • 5
  • 18
  • 50
0
votes
1 answer

How to reduce a week from Rundate in AWS Glue Pyspark

I have a scenario where I am getting a rundate value getting passed in AWS Glue job as 'YYYY-MM-DD' format. Lets say 2021-04-19. Now, I am readin this rundate as 'datetime.strptime(rundate, "%y-%m-%d")' But now i want to create 2 variables out of it…
0
votes
1 answer

Glue - Bookmark doesn't recognize files in newer partitions

I have a glue job that reads from an S3 bucket does transformations and uploads the result in another S3 bucket. Here's what my aws glue get-job-bookmark --job-name xx returns JobBookmark":…
0
votes
1 answer

Remove last delimeter from a .TXT file in Pyspark

I have an S3 file generated from the different system which is as below: A1|~|B1|~|C1|~|D1|~| A2|~|B2|~|C2|~|D2|~| A3|~|B3|~|C3|~|D3|~| A4|~|B4|~|C4|~|D4|~| Now while reading this file in AWS Glue Pyspark script, I want to remove the last…
0
votes
2 answers

Matching up arrays in PySpark

I am trying to manipulate two dataframes using PySpark as part of an AWS Glue job. df1: item tag 1 AB 2 CD 3 EF 4 QQ df2: key1 key2 tags A1 B1 [AB] A1 B2 [AB, CD, EF] A2 B1 [CD, EF] A2…
0
votes
1 answer

Consolidating Many Data Files into One Using Glue - Job Succeeds But Without Output Files

TL;DR I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job Input data is Catalogued in Glue and queryable via Athena Glue Job runs with "Succeeded" output status, but no output files are created Details Input…
0
votes
1 answer

Is it possible to stream AWS cloudwatch logs

I know its is possible to stream CloudWatch Logs Data to Amazon Elasticsearch Service. It is documented here, But is it possible to stream the logs data to a custom AWS Glue Job, or to an EMR Job?
0
votes
1 answer

How do I save machine learning model(Kmeans) in S3 from glue ETL job in written in pyspark?

I tried model.save(sc, path) it gves me error : TypeError: save() takes 2 positional arguments but 3 were given. Here sc is the sparkcontext [sc = SparkContext()] I tried without sc in the signature but got this error : An error occurred while…
0
votes
1 answer

run glue job only when data is updated

I have a glue job that transfers data from S3 to Redshift. I want it to schedule it such that it runs everytime when the data in S3 is reuploaded or updated. How can I do so? I tried the code sol here and made a lambda function: How to Trigger Glue…
user13067694