Questions tagged [aws-glue-spark]
244 questions
1
vote
1 answer
AWS Glue Resolve Column Choice as Array or Struct
Run out of ideas on how to solve the following issue. A table in the Glue data catalog has this schema:
root
|-- _id: string
|-- _field: struct
| |-- ref: choice
| | |-- array
| | | |-- element: struct
| | | | |--…

CPak
- 13,260
- 3
- 30
- 48
1
vote
0 answers
How to write data from AWS Glue to DocumentDB
I am working on a personal project that entails creating an AWS Glue job that will do some basic transformations and move it to a DocumentDB database.
The main problem I am having right now is that I am unable to move the data to the DocumentDB…

KoalaKey
- 252
- 3
- 11
1
vote
0 answers
How to process many tables using AWS Glue
As part of doing data validation I have use-case of processing many tables. Number of tables are almost 2000. Due to tight SLA there is a need now to process many tables concurrently.
Due to Glue concurrency limit of 50 (which I got increased to 100…

Ankur Shrivastava
- 223
- 4
- 14
1
vote
1 answer
Insert into SQL Server table selected columns from spark dataframe
I have a SQL Server table that has a different schema than my dataframe. I would like to select some columns from my dataframe and "insert into" the table the values I selected.
Basically something similar to the code below but in pyspark:
INSERT…

AJR
- 569
- 3
- 12
- 30
1
vote
2 answers
Use of ResolveChoice in Glue
I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).
applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id",…

NikRED
- 1,175
- 2
- 21
- 39
1
vote
1 answer
AWS Glue - Flatten deeply nested JSON
I would like to know if there is a way to flatten deeply nested JSON using Glue ETL job? This has nested arrays in it. I tried to run a Glue crawler on the JSON which returned a catalog with just 1 field PerPlayer with a struct data type.
In the…

srmk
- 85
- 1
- 8
1
vote
1 answer
AWS Glue Python shell Configuration DPU
does 1 dpu setting change when I use glue python shell instead of glue spark?.
I recently saw a post Maximum number of concurrent tasks in 1 DPU in AWS Glue and I saw they were talking about glue spark, but not from aws glue python shell, that's why…

masterdevsshm83_
- 25
- 1
- 9
1
vote
0 answers
Issue running aws glue job locally
I'm trying to run a glue job locally but I'm facing a problem, when I run my script a exception is raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o47.getDynamicFrame.
: java.lang.IllegalAccessError: tried to access method…
1
vote
2 answers
Spark Performance issue - Writing partitions to S3 as individual files
I'm running a spark job whose job is to scan a large file and split it into smaller files. The file is in Json Lines format and I'm trying to partition it by a certain column (id) and save each partition as a separate file to S3. The file size is…

lalatnayak
- 160
- 1
- 6
- 21
1
vote
1 answer
Error in creating table with column name containing dot (.) in Amazon Athena even after escaping the dot with backticks(`)
As per https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html,
Special characters
Special characters other than underscore (_) are not supported. For
more information, see the Apache Hive LanguageManual…

Narendra Damodardas Modi
- 13
- 1
- 4
0
votes
1 answer
AWS Glue: How to filter out data from DynamicFrame when date format is wrong or bad data
In Aws Glue after extracting data in DynamicFrame I'm converting date time format to UTC, But if in case date format is wrong for eg Invalid value for date, It will break entire glue flow.
So I want to Filter out these bad data from DynamicFrame…

Yadav
- 129
- 1
- 11
0
votes
0 answers
How to set AWS Glue proxy settings
I'm trying to set proxy inside glue script in order to connect to external source - snowflake .
But none of the below worked
Approach 1 : Added proxy to the env variable
os.environ['USE_PROXY'] = 'true'
os.environ['http_proxy'] =…

Raju
- 448
- 1
- 9
- 24
0
votes
0 answers
Aws Glue job output many small files
I have AWS Glue job, that I created using the glue job visualizer.
The job reads data from S3 using glue catalog and spark, aggregate the data and store it in new S3 objects partitioned by day. The output data will be queried later.
I see that the…

guylot
- 201
- 2
- 13
0
votes
2 answers
Convert pyspark script to awsglue script
i have a bunch of existing pyspark scripts that I want to execute using AWS Glue. The scripts use APIs like SparkSession.read and various transformation in pyspark DataFrames.
I wasn't able to find docs outlining how to convert such a script. Do…

jusatdeloitte
- 1
- 1
0
votes
0 answers
how to convert spark datframe to pandas dataframe in AWS Glue
I read in data from Snowflake into AWS Glue using spark, which results having a spark dataframe called df. After that I added the following to convert it to a pandas dataframe:
df2 = df.toPandas()
However, this is causing an error in AWS Glue.

gblm
- 47
- 4