6

I am trying to read a table form BigQuery using PySpark.

I have tried the following

table = 'my-project-id.project-dataset.test_table_spark'
df = spark.read.format('bigquery').option('table', table).load()

However, I am getting this error

: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html

How can I read the bigQuery table from pySpark (at the moment I'm using python2)

Alex
  • 1,447
  • 7
  • 23
  • 48

1 Answers1

8

You need to include the jar for the spark-bigquery-connector with your spark-submit. The easiest way to do that would be using the --jars flag to include the publicly available and most up-to-date version of the connector:

spark-submit --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar my_job.py

Though the examples reference Cloud Dataproc, this should work when submitting to any Spark cluster.

Brad Miro
  • 251
  • 1
  • 2
  • 6
  • 1
    is there a way to test it from the notebook? I'm using jupyter as web interface (or testing via SSH) and I don't know how to use the --jars flag in this case – Alex Oct 01 '19 at 15:43
  • 2
    You can try this when creating your SparkSession: `spark = SparkSession.builder.appName('my_app).config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar').getOrCreate()` – Brad Miro Oct 01 '19 at 15:48
  • 1
    Try restarting your Jupyter notebook. If you have any pre-existing SparkSessions, this won't work. Also, make sure you're using a Python Jupyter kernel, not a PySpark Jupyter kernel, per this: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/blob/master/examples/notebooks/Top%20words%20in%20Shakespeare%20by%20work.ipynb – Brad Miro Oct 01 '19 at 16:46
  • 1
    The last solution you posted works and I can read from BQ using pyspark. However, it seems I can't use other packages (such as graphframes). It can't find anymore the class GraphFramePythonAPI. I suspect it is because I'm now running it from a python notebook.. – Alex Oct 03 '19 at 14:12
  • 3
    You can manually add packages to the config also. Try this: `conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'), ('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.4-s_2.11'])` just make sure to replace graphframes with the appropriate package for your version of spark found using `pyspark.__version__` from https://spark-packages.org/package/graphframes/graphframes. Then, `spark = SparkSession.builder.appName('my_app').config(conf=conf).getOrCreate()` – Brad Miro Oct 03 '19 at 15:20