Reading data from different upstream systems in Spark

Question

How to handle data ingestion in spark if data is received from multiple sources systems like RDBMS or sometimes from CSV files or any other file format or upstream systems.

If file format is known then it can be specified while reading as spark.read.csv or spark.read.jdbc. But if it is dynamic, then how to handle data ingestion?

JDBC doesn't have a "file format". Please clarify "dynamic". Are you referring to file extensions, or something else? — OneCricketeer, Dec 13 '21 at 22:59

score 0 · Answer 1 · answered Dec 13 '21 at 20:09

0

Everything is an option as in the following code snippets:

CSV

Dataset<Row> df = spark.read().format("csv")
    .option("header", "true")
    .load("data/books.csv");

Source: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab100_mysql_ingestion/MySQLToDatasetWithOptionsApp.java

JDBC

Dataset<Row> df = spark.read()
    .option("url", "jdbc:mysql://localhost:3306/sakila")
    .option("dbtable", "actor")
    .option("user", "root")
    .option("password", "Spark<3Java")
    .option("useSSL", "false")
    .option("serverTimezone", "EST")
    .format("jdbc")
    .load();

Source: https://github.com/jgperrin/net.jgp.books.spark.ch01/blob/master/src/main/java/net/jgp/books/spark/ch01/lab100_csv_to_dataframe/CsvToDataframeApp.java

You could read all those information from a configuration file. Each read will also result in a unique dataframe, so you could keep an array, a map, or a list of all those dataframes once ingested.

answered Dec 13 '21 at 20:09

jgp

2,069
1
21
40

This doesn't answer the part of the question about dynamic inputs – OneCricketeer Dec 13 '21 at 22:59
It does in the way I understood the question… but you also asked for clarification, so… – jgp Dec 14 '21 at 00:00
Dynamic means today from one source system like rdbms, tomorrow same data from files or any other source with different delimiter. – SparkScala1 Dec 14 '21 at 18:11
Yeah, so my solution should work… you need the configuration somewhere, so read it from there and dynamically create your “reader” as described above. – jgp Dec 15 '21 at 00:23

Reading data from different upstream systems in Spark

1 Answers1

CSV

JDBC