Questions tagged [aws-glue-spark]
244 questions
0
votes
1 answer
SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest
We have implemented delta lake but one issue as below:
One table can be created and ingested, but after new data has been ingested, we will spectrum scan error:
SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest
Detail:
error: Spectrum…

Frank Tao
- 1
- 1
0
votes
0 answers
Unable to write csv file to S3
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
df = spark.read.csv("s3://bucket1/file1.csv", header=True)
df.show(5)
df.write.mode("overwrite").csv("s3://bucket1/file2.csv", header=True)
The write to S3 does…

sheetal_158
- 7,391
- 6
- 27
- 44
0
votes
0 answers
AWS Glue - glueContext.purge_table leads to "No such file or directory 's3://abc..."
I'm using AWS Glue, and I want to overwrite a Glue catalog with a Glue job. During my Glue job, I call
glueContext.purge_table(glue_database, glue_table, options={"retentionPeriod": 0})
My next line is me trying to write out the current dataframe…

Black Dynamite
- 4,067
- 5
- 40
- 75
0
votes
1 answer
How can I optimize the read from S3?
dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://somefile.csv/"],
'recurse':True,
'groupFiles': 'inPartition',
'groupSize':…

sheetal_158
- 7,391
- 6
- 27
- 44
0
votes
0 answers
Write AWS Glue DynamicFrame to redshift table
I have a dynamic frame with following schema
root
|-- source_id: long
|-- scrape_timestamp_last: timestamp
|-- scrap_timestamp_orig: timestamp
|-- job_id_init: string
|-- post_date: timestamp
|-- date_posted: string
|-- date_offset: int
|--…

Gyan Joshi
- 11
- 4
0
votes
1 answer
How to execute REST API call for Glue Dynamic Frame
I need to build glue spark application to transform raw events and then execute REST API to push transformed data. I am using glue Dynamic Frame to transform raw events but not able to execute REST API call. Is there a way to execute REST API or…

PB22
- 31
- 4
0
votes
1 answer
AWS GLUE SQL join with single row from right table
Im trying to join two datasets in AWS glue
Table 1(alias af):
id
data
created
1
string 1
2020-02-10
2
string 2
2020-02-11
3
string 3
2020-02-12
Table 2 (alias mp):
id
data
data2
created
foreign_key
1
string 1
json…

darkCoffy
- 103
- 9
0
votes
1 answer
Environment for print Capture on AWS GLUE
Where can I see, for example, the prints that are written in my AWS GLUE script? Like a terminal screen that shows me the messages that were stored in a print. I need to print the schema being generated for my data output and see if it matches what…

Rafael Souza
- 15
- 3
0
votes
1 answer
Show Method for Dynamic Frame in AWS glue returns empty field
When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it into a spark dataframe and the show() method works fine.
I know that this has happened to…

Stef Kostov
- 11
- 1
0
votes
2 answers
How to parse nested column for CSV data in Pyspark?
I am working on a database where the data is stored in csv format. The DB looks like the following:
id
containertype
size
1
CASE
{height=2.01, length=1.07, width=1.22}
2
PALLET
{height=1.80, length=1.07, width=1.23}
I want to parse the…

biswas N
- 381
- 1
- 16
0
votes
0 answers
Convert Nested Json Schem to Pyspark Schema
I have a schema which has nested fields.When I try to convert it with:
jtopy=json.dumps(schema_message['SchemaDefinition']) #json.dumps take a dictionary as input and returns a string as output.
print(jtopy)
…

user3082928
- 71
- 7
0
votes
0 answers
Read from Glue catalog
I'm trying to get the schema from the Glue catalog in AWS Glue studio but the job keeps running and not returning. This is the code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import…

Alan Mil
- 189
- 1
- 2
- 9
0
votes
1 answer
Spark: Best way to join normal size Dataframe with very large Dataframe
I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive.
I tried broadcast join but DF1 is quite big as well.
I was wondering what would be…

OneWorld
- 952
- 2
- 8
- 21
0
votes
1 answer
AWS Glue Exclude Patterns
I am working on a project which is using Glue 3.0 & PySpark to process large amounts of data between S3 buckets. This is being achieved using GlueContext.create_dynamic_frame_from_options to read the data from an S3 bucket to a DynamicFrame using…

samuel
- 23
- 5
0
votes
1 answer
Issue developing AWS Glue ETL jobs locally using a Docker container
I am using an Apple M1 Pro Mac & trying to use a Docker container to developer AWS Glue Jobs locally and not use the AWS Console. I have been working through this blog post by AWS and I have pulled amazon/aws-glue-libs:glue_libs_3.0.0_image_01 from…

samuel
- 23
- 5