Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
12
votes
4 answers

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error java.io.IOException: No space left on the device On…
Vigneshwaran
  • 782
  • 2
  • 7
  • 22
12
votes
4 answers

AWS Glue job consuming data from external REST API

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!
deorst
  • 169
  • 1
  • 1
  • 9
12
votes
3 answers

How can I use an external python library in AWS Glue?

First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions:…
Marlon Holland
  • 131
  • 1
  • 1
  • 4
12
votes
2 answers

HIVE_PARTITION_SCHEMA_MISMATCH

I'm getting this error from AWS Athena: HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'id' in table 'db.app_events' is declared as type…
Burak
  • 5,706
  • 20
  • 70
  • 110
12
votes
2 answers

How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

I have partitioned data in CSV files on S3: s3://bucket/dataset/p=1/*.csv (partition #1) ... s3://bucket/dataset/p=100/*.csv (partition #100) I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150…
Raffael
  • 19,547
  • 15
  • 82
  • 160
12
votes
9 answers

AWS Glue Crawler Cannot Extract CSV Headers

At my wits end here... I have 15 csv files that I am generating from a beeline query like: beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv I chose dsv because some string fields include commas and they are not quoted, which…
Mac
  • 1,143
  • 6
  • 21
  • 45
12
votes
2 answers

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The…
12
votes
2 answers

AWS Glue: get job_id from within the script using pyspark

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?
Zeitgeist
  • 1,382
  • 2
  • 16
  • 26
12
votes
2 answers

AWS Glue write parquet with partitions

I am able to write to parquet format and partitioned by a column like so: jobname = args['JOB_NAME'] #header is a spark DataFrame header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append',…
stewart99
  • 14,024
  • 7
  • 27
  • 42
12
votes
4 answers

How to move data from Glue to Dynamodb

We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with…
Robby
  • 371
  • 2
  • 3
  • 15
12
votes
4 answers

Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?

Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, that causes AWS Glue crawler to analyse it. I know…
BoIde
  • 306
  • 1
  • 3
  • 16
12
votes
2 answers

CloudFormation: a way to define an ACTIVATED scheduled Glue job trigger

I'm using CloudFormation to define a SCHEDULED Glue job trigger according to the official documentation: ParquetJobTrigger: Type: 'AWS::Glue::Trigger' Properties: Name: !Sub "${Prefix}_csv_to_parquet_job_trigger_${StageName}" Type:…
11
votes
2 answers

AWS Glue vs EMR Serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very promising service. From my understanding - AWS…
alexanoid
  • 24,051
  • 54
  • 210
  • 410
11
votes
2 answers

How to connect AWS Glue to a VPC, and access private resources?

I am trying to connect to services and databases running inside a VPC (private subnets) from an AWS Glue job. The private resources should not be exposed publicly (e.g., moving to a public subnet or setting up public load balancers). Unfortunately,…
11
votes
2 answers

Parquet column cannot be converted in file, Expected: bigint, Found: INT32

I have a glue table with column tlc and its datatype is Bigint. I am trying to do the following using PySpark: Read the Glue table and write it in a Dataframe Join with another table Write the resulting dataframe to a S3 path My code looks…
Gunjan Khandelwal
  • 179
  • 1
  • 2
  • 13