0

We know that, the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.

Is there any other way of writing to aws glue data catalog? I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.

Mehedee Hassan
  • 133
  • 1
  • 9
  • you can do this by following this blog https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/ – Prabhakar Reddy Jul 08 '22 at 04:49

3 Answers3

0

You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.

Steve Scott
  • 1,441
  • 3
  • 20
  • 30
0

We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :

  1. A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
  2. The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
  3. The queries fired :

alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’) 4. This was triggered to run post the main Ingestion Job via Glue Workflows.

Satheesh V
  • 96
  • 7
0

If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

Rahul
  • 1
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 07 '22 at 03:41