Questions tagged [aws-glue-spark]

244 questions
1
vote
0 answers

How to work with schema returned by 'get_catalog_schema_as_spark_schema'?

Example: schema = glueContext.get_catalog_schema_as_spark_schema(database=args['Database'], table_name=args['Table']) if I simply print the returned schema I can see the StructType/StructField structure, something similar to: StructType( …
GSazheniuk
  • 1,340
  • 10
  • 16
1
vote
2 answers

DataFrame remove rows existing in another DataFrame

I have two data frames: df1: +----------+-------------+-------------+--------------+---------------+ |customerId| fullName| telephone1| telephone2| email| +----------+-------------+-------------+--------------+---------------+ | …
TurboAza
  • 75
  • 1
  • 9
1
vote
1 answer

AWS glue pyspark: java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

I'm trying to read a csv file from s3 in my AWS glue pyspark script. Following is the snippet of the code:- import sys import os from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import…
1
vote
2 answers

aws glue pyspark remove struct in an array but keep the data and save into dynamodb

A dynamodb table is exported to s3 and aws glue crawler crawls the s3 data. Aws glue jobs take the source from the crawled data and here's the schema that was transformed by MergeLineItems: def MergeLineItems(rec): rec["lineItems1"] = {} a =…
Minah
  • 81
  • 1
  • 9
1
vote
1 answer

AWS Glue Bad value for type BigDecimal : NaN

I'm trying to export a table I crawled from a postgres(rds) database into glue. There's one field with a decimal(10, 2) type. Now I have several problems. Exporting the table from glue(using spark 2.4, 3.1 python 3) into s3 with the following…
1
vote
1 answer

AWS Glue assigned all tasks to the same worker

I have an AWS Glue job whose work is very simple: break large CSV gzip files into 1GB ones. In my test, I uploaded 4 files into the bucket, each is around 5GB. Yet, the job always assigns all files to a single worker instead of distributing across…
1
vote
0 answers

Convert Glue column datatype to Spark metadata

I have a glue column whose datatype in Glue is struct However when spark infers this schema, it converts this glue type to spark metadata and saves it to Glue table properties as follows: "name": "columnName", "type":…
1
vote
0 answers

Error Running Spark Glue jobs after created DF to tempView

Explanation when I am creating a DF from dynamic frame it works fine and I am able to write dataframe back to dynamic frame but when I am converting a Dataframe to createOrReplaceTempView then it is throwing me this error. The number of…
1
vote
1 answer

Glue PySpark Job: An error occurred while calling o100.pyWriteDynamicFrame

I am building data pipeline for migrating data from S3 bucket to Snowflake via AWS Glue by creating custom connector in AWS Glue. I am getting below Error when running glue job: **An error occurred while calling o100.pyWriteDynamicFrame. Glue ETL…
1
vote
1 answer

How do you specify Project ID in the AWS Glue to BigQuery connector?

I'm trying to use the AWS Glue connector to BigQuery following the tutorial in https://aws.amazon.com/blogs/big-data/migrating-data-from-google-bigquery-to-amazon-s3-using-aws-glue-custom-connectors/ but after following all steps I get a: :…
1
vote
1 answer

Creating dynamic frame issue without the pushdown predicate

New to AWS glue, so pardon my question: Why do I get an error when I don't include a pushdown predicate when creating the dynamic frame. I try to use it without the predicate as I will be using bookmark so only new files will be processed regardless…
1
vote
0 answers

AWS Glue not able to access database in VPC

I have AWS Glue Job which is using Spark and Scala with jdbc connections specified in the script for custom ETL and data decryption. While running the job in an environment where databases are not publicly available the jobs are failing with…
1
vote
1 answer

How would chaning the read in AWS Glue change a column's data type?

I have a AWS Glue job that was slightly modified, only the read was changed, the job runs fine however the datatypes on my columns have changed. Where I previously had BigInt, I now just have Ints. This is causing an EMR Job dependent on these files…
sgallagher
  • 137
  • 10
1
vote
0 answers

Why does Spark SQL add double quotes to some string concat() but not to others? I do not want quotes around numeric fields

Please note that I do not want double quotes around all field; just strings. Working on AWS Glue Studio, if I have select concat(ref_alpha, '!', ref_beta) and send it to a csv file I get "AB12!RT45" but if I have concat(ref_alpha, 'T', ref_beta) I…
1
vote
1 answer

Unable to access csv file generated by a jar file in AWS Glue

This is my first question here! So we're working on some MDM related stuff wherein we need to run a jar file provided by our MDM partner to merge the records. We are able to call the subprocess() method in our AWS Glue script to run the jar file.…