Questions tagged [aws-glue-spark]

244 questions
0
votes
0 answers

SQL Server bcp tool on AWS GLUE job

any one has tried sql server bcp utility on aws Glue job (Python shell or Spark type)? bcp tool needs to be installed using Sudo yum commands however those are not supported on Glue. sudo yum install mssql-tools unixODBC-devel reference link from…
0
votes
0 answers

How to trigger a Glue job from another Glue job

Is it possible to trigger a glue job(pyspark) from another glue job(pyspark) using boto3. Everything seems to be working fine(no syntax or code errors) except the boto3 method glue_client.start_job_run() Tested the similar code in Lambda and it's…
0
votes
1 answer

Reading Spark Dataframe from Partitioned Parquet data

I have parquet data stored on S3 and Athena table partitioned by id and date. The parquet files are stored in s3://bucket_name/table_name/id=x/date=y/ The parquet file contains the partition columns in them (id, date), because of which I am not…
AswinRajaram
  • 1,519
  • 7
  • 18
0
votes
0 answers

How to catch an exception thrown from imported module in pyspark

I want to catch an exception thrown from imported module and raise it to fail the job giving the same exception. for example, ------a.py---------- def check(a, b): try: # Check something except Exception as e: raise…
Tushar Patil
  • 748
  • 4
  • 13
0
votes
1 answer

AWS Glue issue causing a PicklingError

I'm running into an issue with AWS Glue where when I run a Map.apply function to a DataFrame in order to decrypt a given column value it throws an error. The error I'm getting is PicklingError: Could not serialize object: TypeError: can't pickle…
0
votes
1 answer

Glue/Spark: Filter a large dynamic frame with thousands of conditions

I am trying to filter a timeseries glue dynamic frame with millions of rows having data: id val ts a 1.3 2022-05-03T14:18:00.000Z a 9.2 2022-05-03T12:18:00.000Z c 8.2 2022-05-03T13:48:00.000Z I have another pandas dataframe with thousands…
0
votes
2 answers

It is possible use Spark 3.3.0 in AWS Glue 3.0

I would like to use Spark 3.3.0 version features like Trigger.availableNow in AWS Glue 3.0 with scala, but the AWS Glue 3.0 usage Apache spark version 3.1.1, Is there any way to use apache spark 3.3.0 in AWS Glue 3.0 with scala.
krishna Prasad
  • 3,541
  • 1
  • 34
  • 44
0
votes
1 answer

How to calculate number of G.1 Workers in AWS Glue for processing 1TB data?

I have 1TB of data from the parquet S3 to be loaded in AWS Glue Spark Jobs. I am trying to figure out the number of workers needed for this type of requirement. As per me below are the details of the G.1x configuration: 1 DPU added for MasterNode …
0
votes
0 answers

File conversion XML to JSON in S3 through AWS Glue

I have my bucket structure like below and i have xml files landing in this s3 bucket folder. S3:/Fin-app-ops/data-ops/raw-d Need to convert those xml files to JSON files and put back to s3 in same bucket but different…
Sarath
  • 35
  • 3
0
votes
3 answers

AWS glue job (Pyspark) to AWS glue data catalog

We know that, the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it. Is there any other way of writing to aws glue data catalog? I am looking for a direct way…
0
votes
0 answers

Pyspark: Input_filename() returns empty string when reading json.gz file

I am trying to get filenames(file format:json.gz) using input_filename() function in pyspark. Below is the code: df.withColumn("source_file",sql_f.element_at(sql_f.split(sql_f.input_file_name(), "/"), -1) It returns an empty string. Below is the…
0
votes
1 answer

Writing each row in a spark dataframe to a separate json

I have a fairly large dataframe(million rows), and the requirement is to store each of the row in a separate json file. For this data frame root |-- uniqueID: string |-- moreData: array The output should be stored like below for all the…
Thal
  • 93
  • 2
  • 7
0
votes
1 answer

AWS Glue - IllegalArgumentException: Duplicate value for path

I have a messy data source where some field values can come in with two different names but should map to one conformed field name on the output. e.g. data source contains update_date or modified_date and both should map to timestamp. Both field…
Alex R
  • 11,364
  • 15
  • 100
  • 180
0
votes
0 answers

Glue Dynamic Frame Parse text file with ¶ delimiter

I have a text file which look like below. HDR¶20200101 BDY¶1¶Jimmy BDY¶1¶Something TRL¶123 I would like to parse it to a Glue Dynamic Dataframe by filtering out the header trailer. Also assign the header as ID, Name. I tried the below code and it…
need_the_buzz
  • 423
  • 2
  • 9
  • 18
0
votes
1 answer

AWS GLUE Image certificate related issue

I am new to Docker . Please help in resolving the issue. I have created Docker compose file mentioned below : version: "2" services: spark: image: glue/spark:latest container_name: spark ** build: ./spark** hostname: spark ports: -…
pbh
  • 186
  • 1
  • 9