PySpark connecting to Cassandra using google colabs connection issue

Question

I want to connect to cassandra using pyspark from google colabs. I have written the follwing code downloading the spark file and setting it to the path variable with java. The following is the code:

    !wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars com.datastax.spark:spark-cassandra-connector_2.12:3.1.0.jar pyspark-shell'
os.environ['SPARK_SUBMIT'] = '--packages com.datastax.spark:spark-cassandra-connector2.12:3.1.0 pyspark-shell'

os.environ['SPARK_HOME']="/content/spark-3.1.2-bin-hadoop3.2"
conf = SparkConf()
conf.setAppName("Spark Cassandra")  
conf.set("spark.cassandra.connection.host","host").set("spark.cassandra.auth.username","username").set("spark.cassandra.auth.password","password")
sc = SparkContext(conf=conf)
sql = SQLContext(sc)
dataFrame = sql.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace="database").load()
dataFrame.printSchema()

And when i execute this it creates the context session but shows a error of "org.apache.spark.sql.cassandra" this. I guess i have to download the connector seperately and include in my path or i have included in a worng way for my path. If any solution please help. This is in google colabs

score 0 · Answer 1 · answered Sep 10 '21 at 11:06

You didn't provide the full error + stacktrace so it's hard to know what the actual problem is but in any case, you just need to provide the maven coordinates.

This is how you would normally start pyspark:

$ bin/pyspark
  --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
  --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

but I've noted that you're also specifying --jars in your code. Cheers!

PySpark connecting to Cassandra using google colabs connection issue

1 Answers1