1

I'm relatively new to DataLakes and Im going through some research for a project on AWS.

I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.

There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?

Most posts I've come across talk about ETL from RDS to S3 and not the other way around.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Garreth
  • 1,057
  • 2
  • 9
  • 24
  • 1
    How about having an S3 event to a Lambda function? – Bala Nov 08 '19 at 13:26
  • 1
    Have you done a cost comparison on different ways to use ETL functionality in AWS? I mean you can use lambda function, which is probably the cheapest option. How do other options compare? – Abhishek Upadhyaya Jan 13 '20 at 19:56

3 Answers3

1

By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.

Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.

Garreth
  • 1,057
  • 2
  • 9
  • 24
1

Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.

For example, to write to a Redshift DB,

// Write data to staging table in Redshift
glueContext.getJDBCSink(
  catalogConnection = "redshift-glue-connections-test",
  options = JsonOptions(Map(
    "database" -> "conndb",
    "dbtable" -> staging,
    "overwrite" -> "true",
    "preactions" -> "<another SQL queries>",
    "postactions" -> "<some SQL queries>"
  )),
  redshiftTmpDir = tempDir,
  transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)

As shown above, use the JDBC Connection you've created to write the data to.

Jimson James
  • 2,937
  • 6
  • 43
  • 78
1

You can accomplish that with a Glue Job. Sample code:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

file_paths = ['path']


df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})


df.printSchema()

df.show(10)

options = {
'user': 'usr',
'password': 'pwd', 
'url': 'url',
'dbtable': 'tabl'}


glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)
lusddd
  • 11
  • 1