Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
1
vote
1 answer

Delete AWS Glue Crawler After Completion

I have a use case in which I need to create an AWS Glue Crawler to crawl some data stored in S3, start the crawler, then delete the crawler after it has finished crawling the data. The dilemma I've ran into is that the crawler can take a significant…
1
vote
1 answer

how to create a glue search script in python

so I have been asked to write a python script that pulls out all the Glue databases in our aws account, and then lists all the tables and partitions in the database in a CSV file? Its acceptable for it to just run on desktop for now, would really…
FreddieBL
  • 17
  • 1
  • 6
1
vote
1 answer

Not able to put a join and query on two tables in AWS Glue job script

So, I have created a AWS glue job script in which I have added two datasources and converting them to dataframes from dynamicframe. My aim is to get the query from two tables using inner join but I am unable to do that. The job is failing at the…
1
vote
1 answer

Unable to access csv file generated by a jar file in AWS Glue

This is my first question here! So we're working on some MDM related stuff wherein we need to run a jar file provided by our MDM partner to merge the records. We are able to call the subprocess() method in our AWS Glue script to run the jar file.…
1
vote
1 answer

HadoopDataSource: Skipping Partition {} as no new files detected @ s3:

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena. Running a Glue Job and trying to access the Catalog I get the following…
smjm
  • 11
  • 3
1
vote
0 answers

AWS Glue: Can I add another column with matching percentage?

I am new to AWS and Python. I have below AWS Python Spark code to perform Fuzzy matching. All I need is an additional column with Matching percentage. import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from…
1
vote
1 answer

How to import glue job from one module to another module in terraform?

I have a Glue job resource defined in module A, now I want to import it and use the job name in module B, how can I achieve this? I tried something like, in the module B: variable "example_glue_name" { type = string } data "aws_glue_job"…
wawawa
  • 2,835
  • 6
  • 44
  • 105
1
vote
1 answer

Connect to AWS Redshift using awswrangler

import awswrangler as wr con = wr.redshift.connect("MY_GLUE_CONNECTION") What would be the value of "MY_GLUE_CONNECTION"?
1
vote
1 answer

How to convert GUID into integer in pyspark

Hi Stackoverflow fams: I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an…
1
vote
0 answers

springml spark salesforce - Records not found for this query

I use spark salesforce library com.springml:spark-salesforce_2.11:1.1.3 and it was working, but I started to get errors. It looks like Salesforce returns dataframe with a column "Records not found for this query" and nulls. Do you know how to solve…
Mariusz K
  • 66
  • 4
1
vote
1 answer

AWS Glue pipeline with Terraform

We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go. I have been trying to work on…
LazyEval
  • 769
  • 1
  • 8
  • 22
1
vote
3 answers

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through…
1
vote
1 answer

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path

I am trying to purge a partition of a glue catalog table and then recreate the partition using getSink option (similar to truncate/load partition in database) For purging the partition , I am using glueContext.purge_s3_path option with retention…
1
vote
1 answer

Can we set remove column names from s3 partition path and set path to values?

I am just curious, for Spark using Glue sinkFormat, is it possible to save the file as "2021/05/05/filename.parquet" and not as "year=2021/month=05/day=05/filename.parquet". I tried to play with 'writepath' but it works at record level and I believe…
1
vote
1 answer

Col names not detected - AnalysisException: Cannot resolve 'Name' given input columns 'col10'

I'm trying to run a transformation function in a pyspark script: datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0") ... dataframe = datasource0.toDF() ... def…
x89
  • 2,798
  • 5
  • 46
  • 110