Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
1
vote
1 answer

AWS Glue: Get list of objects read by create_dynamic_frame.from_options

I'm using create_dynamic_frame.from_options to read CSV files into a Glue Dynamic Dataframe. My Glue job is using bookmark and from_options has both a transformation ctx configured and recursive search. dyf =…
Mat
  • 1,345
  • 9
  • 15
1
vote
1 answer

How to retrieve original S3 data from AWS Athena

I have an S3 bucket archiving JSON objects via Kinesis Firehose. Each bucket object can contain multiple JSON objects that can vary in the schema. Bucket structure bucket └── archive └── 2021 └── 04 ├── 11 | ├──…
Aki K
  • 1,222
  • 1
  • 27
  • 49
1
vote
2 answers

Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi

Our setup is configured that we have a default Data Lake on AWS using S3 as storage and Glue Catalog as our metastore. We are starting to use Apache Hudi and we could get it working following de AWS documentation. The issue is that, when using the…
gabra
  • 9,484
  • 4
  • 29
  • 45
1
vote
0 answers

Cralwer not creating table in data lake from postgres partition table

My Table is partitioned in postgres. I have created a Glue crawler to create table. I selected the option "Update all new and existing partitions with metadata from the table" in Configure the crawler's output. Since it's partitioned, the table is…
1
vote
1 answer

How to add partitions not in dynamic frame while writing data to S3 in AWS Glue script

While writing the data to S3 using dynamic frame i want to use partitioning columns which are not in dynamic frame. For example: def write_date(outpath,year): glue_context.write_dynamic_frame.from_options( frame = projectedEvents, …
Beginner
  • 71
  • 1
  • 3
  • 10
1
vote
2 answers

How to pass AWS Glue external Spark packages?

I'd like to read, for example, GCP BigQuery tables in AWS Glue. I know in Spark is possible to declare dependencies for connecting to specific data-sources. How to do that within the AWS Glue environment and pass such dependencies?
Vzzarr
  • 4,600
  • 2
  • 43
  • 80
1
vote
1 answer

Getting null when trying to read a column with value '-' from aws glue catalog table

I am reading an Athena table which has a column name br_book_gl1 which has values as '-' and '+'. Athena Source data I am getting the '+'value when reading it as glue catalog table but for '-' values, I am getting null. The datatype is String in…
1
vote
1 answer

Aggregate multiple S3 files into one file

I enabled a Firehose stream to write data to S3. Firehose puts data into S3 file at max interval of 900s. This means around 100 files will be created within one day, which is an overhead for users to manually download. Is there a solution to…
1
vote
1 answer

AWS Glue with PySpark - DynamicFrame export to S3 fails partway through with UnsupportedOperationException

I should preface this by saying I've been using AWS Glue Studio to learn how to use Glue with PySpark, and so far it's been going really well. That was until I encountered an error which I cannot understand (let alone solve). An example of the data…
Jamie
  • 1,530
  • 1
  • 19
  • 35
1
vote
0 answers

AWS Glue Relationalize array in json data to new table in postgres DB

Based on the documentation, it said that Glue able to convert Semi-Structured schema to relational schema, Currently I'm able to create schema by using crawler, and able to store my data by the job script generated by AWS Glue from s3 to postgres…
jiale ko
  • 139
  • 1
  • 13
1
vote
1 answer

AWS Athena - merge small parquet files or leave them?

I have a lot of small parquet files that are read via AWS Glue into Athena. I know that small parquet files (35k or so each due to the way the log outputs them) are not ideal but once they are read into the data catalog, does it matter anymore? In…
Rob M
  • 55
  • 1
  • 5
1
vote
1 answer

What are the differences between AWS sagemaker and sagemaker_pyspark?

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are: Is using sagemaker the…
1
vote
1 answer

Is there a function similar to awsglue's getResolvedOptions that will work in an azure databricks notebook using python?

I'm moving code into azure and am wondering what function would be used to accomplish the same thing in python, but without awsglue: import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JOB_NAME',…
NLG123
  • 31
  • 3
1
vote
0 answers

Facing issue while updating job in AWS Glue Job

I am new to AWS Glue studio. I am trying to create a job involving multiple joins and custom code. Trying to read data from Glue catalog and writing the data into S3 bucket. It was working fine untill recently. I only increased more number of…
1
vote
1 answer

AWS Glue Oracle R12 Connection Successful but then timeout

I have a connection from AWS Glue to Oracle R12 and it seems to work fine when I test it in the "connections" section of AWS Glue: p-*-oracleconnection connected successfully to your instance. I can crawl all the tables etc. and get the whole…
ck3mp
  • 391
  • 5
  • 18