Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

votes

4 answers

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error: An error occurred while calling…

asked Aug 23 '21 at 10:51

Robert Kossendey

6,733
2
12
42

votes

6 answers

AWS Glue executor memory limit

I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError. The same is for driver instance --spark.driver.memory=5g. Is there…

amazon-web-services apache-spark aws-glue

asked Feb 28 '18 at 16:21

Alexey Bakulin

1,229
2
13
15

votes

1 answer

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes…

concurrency limit amazon-emr amazon-athena aws-glue

asked Jul 22 '19 at 12:22

Ilya Kisil

2,490
2
17
31

votes

4 answers

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API…

amazon-web-services amazon-s3 amazon-athena aws-glue

asked Jun 01 '18 at 08:08

Gudzo

votes

2 answers

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', …

hadoop hive presto amazon-athena aws-glue

asked May 15 '18 at 15:35

ln9187

votes

4 answers

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1…

amazon-web-services amazon-emr aws-glue cost-management

asked Feb 07 '18 at 11:32

Yuva

2,831
7
36
60

votes

8 answers

Optional job parameter in AWS Glue?

How can I implement an optional parameter to an AWS Glue Job? I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would like to make this parameter optional, so that the…

python amazon-web-services aws-glue

asked Sep 04 '18 at 08:27

matsev

32,104
16
121
156

votes

1 answer

Spark dynamic frame show method yields nothing

So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I checked the Glue Spark Dynamic Frame it does contents…

python pyspark apache-spark-sql aws-glue

asked May 06 '19 at 22:51

PyRaider

votes

4 answers

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? The usual sqlContext.sql("show tables").show() does…

apache-spark-sql aws-glue

asked Sep 06 '17 at 16:45

Jiří Mauritz

votes

5 answers

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since I'm using Athena, I'd like to convert the CSV files…

amazon-s3 parquet amazon-athena aws-glue

asked Apr 23 '18 at 16:54

mark s.

votes

13 answers

Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas.

python pandas amazon-web-services aws-lambda aws-glue

asked Sep 20 '17 at 18:42

jumpman23

votes

2 answers

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to…

apache-spark pyspark aws-glue

asked Nov 24 '19 at 04:25

user3476463

3,967
22
57
117

votes

4 answers

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM role, and run it, and I get the following…

amazon-web-services amazon-athena aws-glue

asked Aug 20 '19 at 20:54

mhamrah

9,038
4
24
22

votes

6 answers

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01.parquet │ | | | ├──…

amazon-web-services parquet aws-glue

asked Jan 22 '18 at 00:10

chazzmoney

votes

9 answers

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One : I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned. But the demo data of ELB in Athena works fine. Part Two (Scenario:) Suppose I Have a…

amazon-web-services csv amazon-redshift amazon-athena aws-glue

asked Nov 13 '17 at 14:41

Kush Vyas

5,813
2
26
36

Prev 1

…

99 100 Next