0

How to handle data ingestion in spark if data is received from multiple sources systems like RDBMS or sometimes from CSV files or any other file format or upstream systems.

If file format is known then it can be specified while reading as spark.read.csv or spark.read.jdbc. But if it is dynamic, then how to handle data ingestion?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

0

Everything is an option as in the following code snippets:

CSV

Dataset<Row> df = spark.read().format("csv")
    .option("header", "true")
    .load("data/books.csv");

Source: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab100_mysql_ingestion/MySQLToDatasetWithOptionsApp.java

JDBC

Dataset<Row> df = spark.read()
    .option("url", "jdbc:mysql://localhost:3306/sakila")
    .option("dbtable", "actor")
    .option("user", "root")
    .option("password", "Spark<3Java")
    .option("useSSL", "false")
    .option("serverTimezone", "EST")
    .format("jdbc")
    .load();

Source: https://github.com/jgperrin/net.jgp.books.spark.ch01/blob/master/src/main/java/net/jgp/books/spark/ch01/lab100_csv_to_dataframe/CsvToDataframeApp.java

You could read all those information from a configuration file. Each read will also result in a unique dataframe, so you could keep an array, a map, or a list of all those dataframes once ingested.

jgp
  • 2,069
  • 1
  • 21
  • 40
  • This doesn't answer the part of the question about dynamic inputs – OneCricketeer Dec 13 '21 at 22:59
  • It does in the way I understood the question… but you also asked for clarification, so… – jgp Dec 14 '21 at 00:00
  • Dynamic means today from one source system like rdbms, tomorrow same data from files or any other source with different delimiter. – SparkScala1 Dec 14 '21 at 18:11
  • Yeah, so my solution should work… you need the configuration somewhere, so read it from there and dynamically create your “reader” as described above. – jgp Dec 15 '21 at 00:23