Questions tagged [aws-glue-spark]

244 questions
0
votes
0 answers

Error in reading a csv file from aws glue catalog table , data also contain coma

I have s3 file in csv format , which is reading through aws glue job using aws glue catalog . There is 3 fields in s3 file. as follows ID,NAME,COMMENT 1,"XYZ","COMMENT1,COMMENT2" 2,"abc","COMMENT3" 3,"mno","COMMENT4" The issue is while reading…
0
votes
1 answer

How to drop the duplicate column in glue job. As glue is creating duplicate column

I have created the glue job and its creating duplicate column once I run the crawler on transformed file .How to drop the duplicate column in it I have know there is DropNullFields function but it will drop the null field not duplicate coulmn. What…
Parag Shahade
  • 57
  • 3
  • 8
0
votes
1 answer

How to debug an aws glue pyspark job

I have a aws glue pyspark job which is long running after a certain command . In the log it is not writing anything after that command even a simple “print hello “ statement. How can I debug aws glue pyspark job which is long running and not even…
0
votes
0 answers

Strange behavior when editing AWS Glue driver script in AWS console

So basically I have a couple of different Glue jobs (all created from Terraform, but with different workspace for testing purpose), the Glue driver scripts are a little bit different, and they are stored in S3 bucket, then pointed to the targeted…
wawawa
  • 2,835
  • 6
  • 44
  • 105
0
votes
1 answer

Is there a way to define AWS Glue input path with wildcard?

I have a Glue job, it looks at the files for the current date (each date has a folder in S3) and process the data in this folder (e.g: "s3://bucket_name/year/month/day"), now I want to find a way to define the input s3 path which tells Glue to look…
wawawa
  • 2,835
  • 6
  • 44
  • 105
0
votes
1 answer

An error occurred while calling o79.getDynamicFrame. [Amazon](500310) Invalid operation: syntax error at or near "s_next_of_kin"

I have a table in redshift where we have a column name -->( agent's_next_of_kin) if you see it has an apostrophe s in the name now when I am reading it into my DynamicFrame with glue it gives me the above error saying syntax issues . how can I make…
0
votes
1 answer

aws glue studio inner join gives error when one of data catalogue has no records

I am new to aws glue studio. I have created two tables in the AWS glue database with partition as the current date. I am doing inner join & left anti join to process the job. If there is no match my glue job fails with the error AnalysisException:…
0
votes
2 answers

Create a glue job that splits an array into rows?

I currently have data arriving from Firehose into an Athena table. When I view the data it is an array of JSON. Is it possible to use a glue job to split the arrays into separate rows so each row is its own JSON log. For example: Data…
0
votes
0 answers

how to run python Shell glue job by using the glue resources?

python shell jobs run on AWS Glue so they use the DPUs assigned to the GLUE, I was going thru the some tutorials where they were running sql queries which were trigging redshift .My concern was that the computation is happening on redshift which…
0
votes
1 answer

resolve choice for Glue dataframe not working

I have a Glue data frame with the following structure, due to some historical data we have differences in the structure. When I try to change the structure the resolveChoice is not working. |-- logs: array | |-- element: struct | | |--…
Tobias Bruckert
  • 348
  • 2
  • 12
0
votes
1 answer

Executing spark sql in aws glue returns the column name in the queries rather than values

running spark sql in aws glue returns the column name in the queries data: product,price,quantityinKG mango,100,1 apple,200,3 peach,200,2 mango,200,2 My Test Query eg : select product,sum(price) from myDataSource …
0
votes
1 answer

Glue: map/process source table's column data and write it to columns in pre-existing redshift table

I am very new to Glue and came across to a scenario where we've source table in glue catalog and we need to write it's data to specific columns in pre-existing table in redshift. e.g. source_table_name[source_table_column_name]. …
newbieitTech
  • 57
  • 1
  • 7
0
votes
1 answer

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing. But Glue job encounters…
Swapnil
  • 11
  • 1
  • 2
0
votes
1 answer

unable to convert from spark dataframe to AWS Glue dynamic frame

I have a spark dataframe named cost_matrix. I am trying to convert this spark dataframe to a aws glue dynamic frame using the following line of code: glue_cost_matrix = DynamicFrame.fromDF(cost_matrix, glueContext, 'glue_cost_matrix') However, I'm…
brenda
  • 656
  • 8
  • 24
0
votes
0 answers

Loading data from AWS EMR to Redshift using Glue is very slow

I am trying to load data from AWS EMR(data storage as S3 and glue-catalog for metastore) to Redshift. import sys import boto3 from datetime import datetime,date from awsglue.transforms import * from awsglue.utils import getResolvedOptions from…