Highest Voted 'aws-glue-spark' Questions

0

votes

1 answer

SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest

We have implemented delta lake but one issue as below: One table can be created and ingested, but after new data has been ingested, we will spectrum scan error: SQL Error [XX000]: ERROR: Spectrum Scan Error: DeltaManifest Detail: error: Spectrum…

python hadoop-partitioning aws-glue-spark

asked May 12 '22 at 01:57

Frank Tao

1
1

0

votes

0 answers

Unable to write csv file to S3

sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session df = spark.read.csv("s3://bucket1/file1.csv", header=True) df.show(5) df.write.mode("overwrite").csv("s3://bucket1/file2.csv", header=True) The write to S3 does…

amazon-web-services amazon-s3 pyspark aws-glue-spark

asked May 03 '22 at 10:47

sheetal_158

7,391
6
27
44

0

votes

0 answers

AWS Glue - glueContext.purge_table leads to "No such file or directory 's3://abc..."

I'm using AWS Glue, and I want to overwrite a Glue catalog with a Glue job. During my Glue job, I call glueContext.purge_table(glue_database, glue_table, options={"retentionPeriod": 0}) My next line is me trying to write out the current dataframe…

amazon-web-services aws-glue aws-glue-data-catalog aws-glue-spark

asked Apr 28 '22 at 19:55

Black Dynamite

4,067
5
40
75

0

votes

1 answer

How can I optimize the read from S3?

dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options( connection_type="s3", format="csv", connection_options={ "paths": ["s3://somefile.csv/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize':…

amazon-s3 aws-glue aws-glue-spark aws-glue3.0

asked Apr 28 '22 at 04:41

sheetal_158

7,391
6
27
44

0

votes

0 answers

Write AWS Glue DynamicFrame to redshift table

amazon-redshift aws-glue-spark

asked Apr 23 '22 at 06:59

Gyan Joshi

11
4

0

votes

1 answer

How to execute REST API call for Glue Dynamic Frame

I need to build glue spark application to transform raw events and then execute REST API to push transformed data. I am using glue Dynamic Frame to transform raw events but not able to execute REST API call. Is there a way to execute REST API or…

boto3 aws-glue aws-glue-spark

asked Apr 14 '22 at 23:06

PB22

31
4

0

votes

1 answer

AWS GLUE SQL join with single row from right table

Im trying to join two datasets in AWS glue Table 1(alias af): id data created 1 string 1 2020-02-10 2 string 2 2020-02-11 3 string 3 2020-02-12 Table 2 (alias mp): id data data2 created foreign_key 1 string 1 json…

mysql apache-spark apache-spark-sql aws-glue aws-glue-spark

asked Apr 12 '22 at 17:28

darkCoffy

103
9

0

votes

1 answer

Environment for print Capture on AWS GLUE

Where can I see, for example, the prints that are written in my AWS GLUE script? Like a terminal screen that shows me the messages that were stored in a print. I need to print the schema being generated for my data output and see if it matches what…

amazon-web-services aws-glue jobs aws-glue-data-catalog aws-glue-spark

asked Apr 07 '22 at 15:02

Rafael Souza

15
3

0

votes

1 answer

Show Method for Dynamic Frame in AWS glue returns empty field

When I try to use the dyF.show() it returns an empty field, even though I checked the schema and count() and I know the table is populated. I transformed it into a spark dataframe and the show() method works fine. I know that this has happened to…

amazon-web-services dataframe apache-spark pyspark aws-glue-spark

asked Apr 06 '22 at 14:04

Stef Kostov

11
1

0

votes

2 answers

How to parse nested column for CSV data in Pyspark?

I am working on a database where the data is stored in csv format. The DB looks like the following: id containertype size 1 CASE {height=2.01, length=1.07, width=1.22} 2 PALLET {height=1.80, length=1.07, width=1.23} I want to parse the…

pyspark apache-spark-sql aws-glue-spark

asked Mar 15 '22 at 12:50

biswas N

381
1
16

0

votes

0 answers

Convert Nested Json Schem to Pyspark Schema

I have a schema which has nested fields.When I try to convert it with: jtopy=json.dumps(schema_message['SchemaDefinition']) #json.dumps take a dictionary as input and returns a string as output. print(jtopy) …

json python-3.x apache-spark pyspark aws-glue-spark

asked Mar 10 '22 at 20:49

user3082928

71
7

0

votes

0 answers

Read from Glue catalog

I'm trying to get the schema from the Glue catalog in AWS Glue studio but the job keeps running and not returning. This is the code: from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.dynamicframe import…

aws-glue aws-glue-data-catalog aws-glue-spark

asked Mar 03 '22 at 08:08

Alan Mil

189
1
2
9

0

votes

1 answer

Spark: Best way to join normal size Dataframe with very large Dataframe

I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive. I tried broadcast join but DF1 is quite big as well. I was wondering what would be…

apache-spark pyspark etl aws-glue aws-glue-spark

asked Mar 02 '22 at 09:11

OneWorld

952
2
8
21

0

votes

1 answer

AWS Glue Exclude Patterns

I am working on a project which is using Glue 3.0 & PySpark to process large amounts of data between S3 buckets. This is being achieved using GlueContext.create_dynamic_frame_from_options to read the data from an S3 bucket to a DynamicFrame using…

amazon-web-services amazon-s3 aws-glue aws-glue-spark

asked Feb 24 '22 at 15:57

samuel

23
5

0

votes

1 answer

Issue developing AWS Glue ETL jobs locally using a Docker container

I am using an Apple M1 Pro Mac & trying to use a Docker container to developer AWS Glue Jobs locally and not use the AWS Console. I have been working through this blog post by AWS and I have pulled amazon/aws-glue-libs:glue_libs_3.0.0_image_01 from…

amazon-web-services docker aws-glue aws-glue-spark aws-glue-connection

asked Feb 18 '22 at 09:56

samuel

23
5

Questions tagged [aws-glue-spark]