Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
21
votes
4 answers

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error: An error occurred while calling…
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
21
votes
6 answers

AWS Glue executor memory limit

I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError. The same is for driver instance --spark.driver.memory=5g. Is there…
Alexey Bakulin
  • 1,229
  • 2
  • 13
  • 15
20
votes
1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
20
votes
4 answers

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API…
Gudzo
  • 639
  • 2
  • 8
  • 21
20
votes
2 answers

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', …
ln9187
  • 730
  • 1
  • 7
  • 23
20
votes
4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…
Yuva
  • 2,831
  • 7
  • 36
  • 60
18
votes
8 answers

Optional job parameter in AWS Glue?

How can I implement an optional parameter to an AWS Glue Job? I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would like to make this parameter optional, so that the…
matsev
  • 32,104
  • 16
  • 121
  • 156
17
votes
1 answer

Spark dynamic frame show method yields nothing

So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I checked the Glue Spark Dynamic Frame it does contents…
PyRaider
  • 607
  • 4
  • 11
  • 21
17
votes
4 answers

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does…
Jiří Mauritz
  • 421
  • 1
  • 4
  • 11
16
votes
5 answers

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since I'm using Athena, I'd like to convert the CSV files…
mark s.
  • 656
  • 2
  • 7
  • 14
16
votes
13 answers

Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.
jumpman23
  • 385
  • 2
  • 5
  • 13
15
votes
2 answers

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to…
user3476463
  • 3,967
  • 22
  • 57
  • 117
15
votes
4 answers

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM role, and run it, and I get the following…
mhamrah
  • 9,038
  • 4
  • 24
  • 22
15
votes
6 answers

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │   ├── year=2016 │ │   ├── year=2017 │ │   │   ├── month=11 │ │   | │   ├── sometype-2017-11-01.parquet │ | | | ├──…
chazzmoney
  • 221
  • 2
  • 9
15
votes
9 answers

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One : I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned. But the demo data of ELB in Athena works fine. Part Two (Scenario:) Suppose I Have a…
Kush Vyas
  • 5,813
  • 2
  • 26
  • 36