Questions tagged [aws-glue-spark]

244 questions
0
votes
1 answer

SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest

We have implemented delta lake but one issue as below: One table can be created and ingested, but after new data has been ingested, we will spectrum scan error: SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest Detail: error: Spectrum…
Frank Tao
  • 1
  • 1
0
votes
0 answers

Unable to write csv file to S3

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session df = spark.read.csv("s3://bucket1/file1.csv", header=True) df.show(5) df.write.mode("overwrite").csv("s3://bucket1/file2.csv", header=True) The write to S3 does…
sheetal_158
  • 7,391
  • 6
  • 27
  • 44
0
votes
0 answers

AWS Glue - glueContext.purge_table leads to "No such file or directory 's3://abc..."

I'm using AWS Glue, and I want to overwrite a Glue catalog with a Glue job. During my Glue job, I call glueContext.purge_table(glue_database, glue_table, options={"retentionPeriod": 0}) My next line is me trying to write out the current dataframe…
0
votes
1 answer

How can I optimize the read from S3?

dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options( connection_type="s3", format="csv", connection_options={ "paths": ["s3://somefile.csv/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize':…
sheetal_158
  • 7,391
  • 6
  • 27
  • 44
0
votes
0 answers

Write AWS Glue DynamicFrame to redshift table

I have a dynamic frame with following schema root |-- source_id: long |-- scrape_timestamp_last: timestamp |-- scrap_timestamp_orig: timestamp |-- job_id_init: string |-- post_date: timestamp |-- date_posted: string |-- date_offset: int |--…
Gyan Joshi
  • 11
  • 4
0
votes
1 answer

How to execute REST API call for Glue Dynamic Frame

I need to build glue spark application to transform raw events and then execute REST API to push transformed data. I am using glue Dynamic Frame to transform raw events but not able to execute REST API call. Is there a way to execute REST API or…
PB22
  • 31
  • 4
0
votes
1 answer

AWS GLUE SQL join with single row from right table

Im trying to join two datasets in AWS glue Table 1(alias af): id data created 1 string 1 2020-02-10 2 string 2 2020-02-11 3 string 3 2020-02-12 Table 2 (alias mp): id data data2 created foreign_key 1 string 1 json…
0
votes
1 answer

Environment for print Capture on AWS GLUE

Where can I see, for example, the prints that are written in my AWS GLUE script? Like a terminal screen that shows me the messages that were stored in a print. I need to print the schema being generated for my data output and see if it matches what…
0
votes
1 answer

Show Method for Dynamic Frame in AWS glue returns empty field

When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it into a spark dataframe and the show() method works fine. I know that this has happened to…
0
votes
2 answers

How to parse nested column for CSV data in Pyspark?

I am working on a database where the data is stored in csv format. The DB looks like the following: id containertype size 1 CASE {height=2.01, length=1.07, width=1.22} 2 PALLET {height=1.80, length=1.07, width=1.23} I want to parse the…
biswas N
  • 381
  • 1
  • 16
0
votes
0 answers

Convert Nested Json Schem to Pyspark Schema

I have a schema which has nested fields.When I try to convert it with: jtopy=json.dumps(schema_message['SchemaDefinition']) #json.dumps take a dictionary as input and returns a string as output. print(jtopy) …
0
votes
0 answers

Read from Glue catalog

I'm trying to get the schema from the Glue catalog in AWS Glue studio but the job keeps running and not returning. This is the code: from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.dynamicframe import…
Alan Mil
  • 189
  • 1
  • 2
  • 9
0
votes
1 answer

Spark: Best way to join normal size Dataframe with very large Dataframe

I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive. I tried broadcast join but DF1 is quite big as well. I was wondering what would be…
OneWorld
  • 952
  • 2
  • 8
  • 21
0
votes
1 answer

AWS Glue Exclude Patterns

I am working on a project which is using Glue 3.0 & PySpark to process large amounts of data between S3 buckets. This is being achieved using GlueContext.create_dynamic_frame_from_options to read the data from an S3 bucket to a DynamicFrame using…
0
votes
1 answer

Issue developing AWS Glue ETL jobs locally using a Docker container

I am using an Apple M1 Pro Mac & trying to use a Docker container to developer AWS Glue Jobs locally and not use the AWS Console. I have been working through this blog post by AWS and I have pulled amazon/aws-glue-libs:glue_libs_3.0.0_image_01 from…