Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

votes

4 answers

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error java.io.IOException: No space left on the device On…

amazon-s3 pyspark aws-glue

asked Dec 28 '20 at 13:38

Vigneshwaran

votes

4 answers

AWS Glue job consuming data from external REST API

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!

aws-glue aws-glue-data-catalog

asked Jan 13 '20 at 09:55

deorst

votes

3 answers

How can I use an external python library in AWS Glue?

First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions:…

python amazon-web-services openpyxl aws-glue

asked Oct 02 '19 at 16:55

Marlon Holland

votes

2 answers

HIVE_PARTITION_SCHEMA_MISMATCH

I'm getting this error from AWS Athena: HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'id' in table 'db.app_events' is declared as type…

amazon-athena aws-glue

asked Sep 26 '19 at 17:38

Burak

5,706
20
70
110

votes

2 answers

How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

I have partitioned data in CSV files on S3: s3://bucket/dataset/p=1/*.csv (partition #1) ... s3://bucket/dataset/p=100/*.csv (partition #100) I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150…

amazon-athena aws-glue aws-glue-data-catalog

asked Sep 11 '19 at 13:25

Raffael

19,547
15
82
160

votes

9 answers

AWS Glue Crawler Cannot Extract CSV Headers

At my wits end here... I have 15 csv files that I am generating from a beeline query like: beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv I chose dsv because some string fields include commas and they are not quoted, which…

csv amazon-athena aws-glue

asked Jan 25 '19 at 21:57

Mac

1,143
6
21
45

votes

2 answers

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The…

amazon-web-services amazon-s3 amazon-athena aws-glue

asked Aug 27 '18 at 03:51

Balajee Addanki

votes

2 answers

AWS Glue: get job_id from within the script using pyspark

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?

amazon-web-services aws-glue

asked Mar 15 '18 at 13:39

Zeitgeist

1,382
2
16
26

votes

2 answers

AWS Glue write parquet with partitions

I am able to write to parquet format and partitioned by a column like so: jobname = args['JOB_NAME'] #header is a spark DataFrame header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append',…

amazon-web-services apache-spark pyspark aws-glue

asked Mar 06 '18 at 23:28

stewart99

14,024
7
27
42

votes

4 answers

How to move data from Glue to Dynamodb

We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with…

amazon-s3 amazon-dynamodb etl aws-glue

asked Mar 02 '18 at 05:58

Robby

votes

4 answers

Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?

Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, that causes AWS Glue crawler to analyse it. I know…

amazon-web-services amazon-s3 aws-glue

asked Feb 16 '18 at 13:47

BoIde

votes

2 answers

CloudFormation: a way to define an ACTIVATED scheduled Glue job trigger

I'm using CloudFormation to define a SCHEDULED Glue job trigger according to the official documentation: ParquetJobTrigger: Type: 'AWS::Glue::Trigger' Properties: Name: !Sub "${Prefix}_csv_to_parquet_job_trigger_${StageName}" Type:…

amazon-web-services aws-cloudformation aws-glue

asked Feb 05 '18 at 08:09

Andrey Cheptsov

votes

2 answers

AWS Glue vs EMR Serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very promising service. From my understanding - AWS…

amazon-web-services amazon-emr aws-glue emr-serverless

asked Dec 12 '21 at 08:10

alexanoid

24,051
54
210
410

votes

2 answers

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers). Unfortunately,…

amazon-web-services aws-glue amazon-vpc aws-glue-data-catalog

asked May 01 '20 at 10:38

Turiphro

votes

2 answers

Parquet column cannot be converted in file, Expected: bigint, Found: INT32

I have a glue table with column tlc and its datatype is Bigint. I am trying to do the following using PySpark: Read the Glue table and write it in a Dataframe Join with another table Write the resulting dataframe to a S3 path My code looks…

apache-spark pyspark amazon-emr parquet aws-glue

asked Mar 24 '20 at 03:34

Gunjan Khandelwal

Prev 1 2 3

…

99 100 Next