2

Aim : Build a small ETL framework to take a Huge CSV and dump it into RDB(say MySQL).

The current approach we are thinking about is to load csv using spark into a dataframe and persist it and later use frameworks like apache scoop and and load it into mySQL.

Need recommendations on which format to persist and on the approach itself.

Edit: CSV will have around 50 million rows with 50-100 columns. Since our tasks involves lots of transformations before dumping into RDB, we thought using spark was a good idea.

Abhishek
  • 432
  • 5
  • 19
  • 5
    some more information would be helpful. size of file, why do u want to use Spark? is there any processing involved? if it is a CSV, then why not use direct DB loaders to dump/ import data in DB? – Sumit Jan 18 '16 at 07:44
  • I'm doing the same with CSV file of multiple terabytes in size. Plus the data in the CSV needs to be parsed / transformed. Does using Spark make sense or what other tool would be better suited? – L4zl0w Jan 29 '16 at 22:59

1 Answers1

0

Spark SQL support to writing to RDB directly. You can load your huge CSV as DataFrame, transform it, and call below API to save it to database. Please refer to below API:

org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
def saveTable(df: DataFrame,
              url: String,
              table: String,
              properties: Properties): Unit
Saves the RDD to the database in a single transaction.

Example Code:

val url: String = "jdbc:oracle:thin:@your_domain:1521/dbname"
val driver: String = "oracle.jdbc.OracleDriver"
val props = new java.util.Properties()
props.setProperty("user", "username")
props.setProperty("password", "userpassword")
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(dataFrame, url, "table_name", props)
Shawn Guo
  • 3,169
  • 3
  • 21
  • 28