combining data from different sources in apache spark

Question

I am exploring apache spark for a project where I want to get data from different sources - database tables (postgres and BigQuery), and text. The data will be processed and fed into another table for analytics. My choice of the programming language is Java, but I am exploring Python too.Can someone please let me know if I can read the directly into spark for processing? Do I need some kind of connector between the database tables and the Spark cluster.

Thanks in advance.

score 0 · Accepted Answer · answered Jan 17 '20 at 08:30

If for example you want to read the content from a BigQuery table, you can do it through these instructions (Python for example):

words = spark.read.format('bigquery') \
   .option('table', 'bigquery-public-data:samples.shakespeare') \
   .load()

you can refer to this document [1] (here you can see also the instructions with Scala).

***I recommend trying the wordcount code first to get used of the usage pattern****

After that, and you have your Spark code ready, you have to create a new cluster in Google Dataproc [2] and run the job there, linking the BigQuery connector (example with python):

gcloud dataproc jobs submit pyspark wordcount.py \
   --cluster cluster-name \
   --region cluster-region (example: "us-central1") \
   --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

Here you can find the latest version of the BigQuery connector [3].

In addition, in this GitHub repository you can find some examples of how to use BigQuery connector with Spark [4].

With these instructions you should be able to handle reading and writing BigQuery.

[1] https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#running_the_code

[2] https://cloud.google.com/dataproc/docs/guides/create-cluster

[3] gs://spark-lib/bigquery/spark-bigquery-latest.jar

[4] https://github.com/GoogleCloudDataproc/spark-bigquery-connector

You're welcome, but if you think this answer is good, please accept it. Thank you! — Gonzalo Pérez Fernández, Jan 20 '20 at 08:07

score -1 · Answer 2 · answered Jan 16 '20 at 22:45

You can connect to rdbms using jdbc. Spark has connector for BigQuery as well. Read from all the sources into data frames separately and combine at the end (assuming all have the same data format)

sample pseudo code for pyspark:

df1=spark.read.json("s3://test.json") df2 = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://xxxx").option("driver", "com.mysql.jdbc.Driver").option("table", "name").option("user", "user").option("password", "password").load()

result = df1.union(df2)

combining data from different sources in apache spark

2 Answers2