Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

AWS Glue consists of a number of components components:

  1. A data catalog (implementing functionality of a Hive Metastore) across AWS data sources, primarily S3, but also any JDBC data source on AWS including Amazon RDS and Amazon Redshift
  2. Crawlers, which perform data classification and schema discovery across S3 data and register data with the data catalog
  3. A distributed data processing framework which extends PySpark with functionality for increased schema flexibility.
  4. Code generation tools to template and bootstrap data processing scripts
  5. Scheduling for crawlers and data processing scripts
  6. Serverless development and execution of scripts in an Apache Spark (2.x) environment.

Data registered in the AWS Glue Data Catalog is available to many AWS Services, including

  • Amazon Redshift Spectrum
  • EMR (Hadoop, Hive, HBase, Presto, Spark, Impala, etc.)
  • Amazon Athena
  • AWS Glue scripts
4003 questions
1
vote
1 answer

format AWS glue spark dataframe output

I am trying to print my DataFrame on the log: datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev", table_name = "sellout_data_cw01_15_csv", transformation_ctx = "datasource0") .... dataframe =…
x89
  • 2,798
  • 5
  • 46
  • 110
1
vote
1 answer

Py4JJavaError: An error occurred while calling o67.getDynamicFrame. java.lang.reflect.InvocationTargetException

While working on nested json file using DyamicFrame for struct type of data. When i run the jobs its getting this error Py4JJavaError: An error occurred while calling o67.getDynamicFrame. java.lang.reflect.InvocationTargetException.Let me know where…
Parag Shahade
  • 57
  • 3
  • 8
1
vote
1 answer

AWS Glue equivalent of EMRFS role mappings

In EMR, you can specify that operations that read/write S3 will assume a different IAM role depending on the S3 URL through EMRFS role mappings. This is especially useful for a use case of reading from a bucket in the same account, and writing to a…
wrschneider
  • 17,913
  • 16
  • 96
  • 176
1
vote
0 answers

Script location tab is empty in Glue Job

I'm trying to run the cloudformation template which has GlueJOB resource to create. GlueJob picks the python script from s3 bucket which is in ohio region using scriptlocation command. Template and the python scripts works perfectly fine in ohio…
Abhishek Hc
  • 93
  • 2
  • 9
1
vote
1 answer

When i try to run Glue job using python to Relationalize array and struct data

When i try to run Glue job using python to Relationalize array and struct data I'm getting below error INFO Log4j appears to be running in a Servlet environment, but there's no log4j-web module available. If you want better web container support,…
Parag Shahade
  • 57
  • 3
  • 8
1
vote
1 answer

Is there a way to use Apache Hudi on AWS glue?

Trying to explore apach hudi for doing incremental load using S3 as a source and then finally saving the output to a different location in S3 through AWS glue job. Any blogs/articles which can help here as a starting point ?
shikeb
  • 11
  • 1
  • 3
1
vote
0 answers

FileAlreadyExistsException: File already exists:s3

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns (date and geohash) Mind you that this is PROD data and the environment is PROD.…
1
vote
1 answer

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?

I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100…
1
vote
0 answers

Glue Crawler creating a separate table for each csv file

I am copying csv files from source s3 bucket to destination s3 bucket. Once the files are in the destination bucket we have a glue crawler that populates the data in redshift. For some reason the crawler is creating a table for each csv file. I…
anaz8
  • 105
  • 2
  • 15
1
vote
0 answers

Glue ETL job- Reading data from onpremise database- using catalog connection

I have a glue ETL job which write data to an onpremise postgreSql database. I'm unable to find an effective option within glue methods to read the data from same database using the jdbc connection. Below is the existing approach: Reads data from…
1
vote
0 answers

How AWS Glue job paramenter 'MaxConcurrentRuns' relate to the concurrent executions of the StepFunction

I have a couple of StepFunctions staemachines that I want to run them concurrently, each of them will link to a Glue job, currently the MaxConcurrentRuns parameter for Glue job is set to the default value 1, I understand this means we can only have…
wawawa
  • 2,835
  • 6
  • 44
  • 105
1
vote
0 answers

Using AWS glue schema registry with custom SerDe clients

For supporting schema registry on my MSK topic, I found two options - AWS Glue Schema Registry; and Confluent Schema Registry Since, Glue SR is fully managed by AWS, I would prefer to use that. However, my producer and consumer clients are written…
1
vote
0 answers

How to stop AWS glue from creating 1 individual table instead of multiple tables

I have folder structure as following in S3 Data table1/output/table1.csv table2/output/table2.csv table3/output/table3.csv My ideal goal is to have a Glue Crawler to have 3 respective tables created. Instead what is created is 1 table called…
1
vote
2 answers

How to load files to s3 from code commit using Cloudformation

I have a Glue job which will take gluescript from S3 bucket. The gluescript will be pushed to code commit and from code commit the gluescript will be automatically copied to S3. My query is : How to load files to s3 from code commit using…
1
vote
2 answers

Spark wrongly casting integers as `struct`

In a spark job, I am using .withColumn("year", year(to_timestamp(lit(col("timestamp"))))) This code used to work. But now I get the error : "cannot resolve 'CAST(`timestamp` AS TIMESTAMP)' due to data type mismatch: cannot cast…
Hugo
  • 1,195
  • 2
  • 12
  • 36