Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
Code generation tools to template and bootstrap data processing scripts
Scheduling for crawlers and data processing scripts
Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

Amazon Redshift Spectrum
EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
Amazon Athena
AWS Glue scripts

4003 questions

vote

1 answer

Delete AWS Glue Crawler After Completion

I have a use case in which I need to create an AWS Glue Crawler to crawl some data stored in S3, start the crawler, then delete the crawler after it has finished crawling the data. The dilemma I've ran into is that the crawler can take a significant…

amazon-web-services aws-glue aws-java-sdk

asked May 26 '21 at 02:24

EnlightenedMonk

vote

1 answer

how to create a glue search script in python

so I have been asked to write a python script that pulls out all the Glue databases in our aws account, and then lists all the tables and partitions in the database in a CSV file? Its acceptable for it to just run on desktop for now, would really…

python database automation aws-glue

asked May 25 '21 at 13:20

FreddieBL

vote

1 answer

Not able to put a join and query on two tables in AWS Glue job script

So, I have created a AWS glue job script in which I have added two datasources and converting them to dataframes from dynamicframe. My aim is to get the query from two tables using inner join but I am unable to do that. The job is failing at the…

amazon-web-services apache-spark pyspark apache-spark-sql aws-glue

asked May 21 '21 at 14:26

ASingh

vote

1 answer

Unable to access csv file generated by a jar file in AWS Glue

This is my first question here! So we're working on some MDM related stuff wherein we need to run a jar file provided by our MDM partner to merge the records. We are able to call the subprocess() method in our AWS Glue script to run the jar file.…

amazon-web-services aws-glue executable-jar aws-glue-spark reltio

asked May 21 '21 at 12:05

Elhan Shaji

vote

1 answer

HadoopDataSource: Skipping Partition {} as no new files detected @ s3:

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena. Running a Glue Job and trying to access the Catalog I get the following…

scala apache-spark aws-glue aws-glue-data-catalog

asked May 18 '21 at 16:58

smjm

vote

0 answers

AWS Glue: Can I add another column with matching percentage?

I am new to AWS and Python. I have below AWS Python Spark code to perform Fuzzy matching. All I need is an additional column with Matching percentage. import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from…

python amazon-web-services apache-spark aws-glue fuzzy-search

asked May 18 '21 at 12:26

Sandeep M

vote

1 answer

How to import glue job from one module to another module in terraform?

I have a Glue job resource defined in module A, now I want to import it and use the job name in module B, how can I achieve this? I tried something like, in the module B: variable "example_glue_name" { type = string } data "aws_glue_job"…

amazon-web-services terraform aws-glue terraform-provider-aws

asked May 17 '21 at 10:58

wawawa

2,835
6
44
105

vote

1 answer

Connect to AWS Redshift using awswrangler

import awswrangler as wr con = wr.redshift.connect("MY_GLUE_CONNECTION") What would be the value of "MY_GLUE_CONNECTION"?

python amazon-web-services amazon-redshift aws-glue aws-data-wrangler

asked May 16 '21 at 13:09

PrajaktaParkhade

vote

1 answer

How to convert GUID into integer in pyspark

Hi Stackoverflow fams: I am new to pyspark and trying to learn as much as I can. But for now, I want to convert GUID's into integers in pysprak. I can currently run the following statement in SQL to convert GUID's into an…

apache-spark pyspark apache-spark-sql aws-glue aws-glue-spark

asked May 14 '21 at 15:31

Sisay

vote

0 answers

springml spark salesforce - Records not found for this query

I use spark salesforce library com.springml:spark-salesforce_2.11:1.1.3 and it was working, but I started to get errors. It looks like Salesforce returns dataframe with a column "Records not found for this query" and nulls. Do you know how to solve…

apache-spark pyspark salesforce aws-glue

asked May 14 '21 at 06:30

Mariusz K

vote

1 answer

AWS Glue pipeline with Terraform

We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go. I have been trying to work on…

amazon-web-services terraform aws-glue

asked May 12 '21 at 07:39

LazyEval

vote

3 answers

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through…

amazon-web-services pyspark apache-spark-sql amazon-emr aws-glue

asked May 10 '21 at 18:00

hunterm

vote

1 answer

AWS Glue - getSink() is throwing "No such file or directory" right after glue_context.purge_s3_path

I am trying to purge a partition of a glue catalog table and then recreate the partition using getSink option (similar to truncate/load partition in database) For purging the partition , I am using glueContext.purge_s3_path option with retention…

amazon-s3 pyspark aws-glue aws-glue-data-catalog

asked May 07 '21 at 18:45

ForeverStudent

vote

1 answer

Can we set remove column names from s3 partition path and set path to values?

I am just curious, for Spark using Glue sinkFormat, is it possible to save the file as "2021/05/05/filename.parquet" and not as "year=2021/month=05/day=05/filename.parquet". I tried to play with 'writepath' but it works at record level and I believe…

amazon-web-services scala apache-spark amazon-s3 aws-glue

asked May 05 '21 at 05:49

Charmee Lee

vote

1 answer

Col names not detected - AnalysisException: Cannot resolve 'Name' given input columns 'col10'

I'm trying to run a transformation function in a pyspark script: datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "test_csv", transformation_ctx = "datasource0") ... dataframe = datasource0.toDF() ... def…

python apache-spark pyspark aws-glue

asked May 03 '21 at 20:33

x89

2,798
5
46
110

Prev 1 2 3

…

99 100 Next