1

In order to parallelize the read operation and read with more than one executor. Rather than the following read code, I want to read with JDBC.

hosts ={"spark.cassandra.connection.host":'node1_ip,node2_ip,node3_ip',
   "table":"ex_table","keyspace":"ex_keyspace"}
data_frame=sqlContext.read.format("org.apache.spark.sql.cassandra") \
  .options(**hosts).load()

How can I read Cassandra data using JDBC from pySpark?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
murzade
  • 29
  • 5
  • using JDBC with Cassandra will be very inefficient... What is the problem with Spark Cassandra Connector? – Alex Ott Sep 12 '22 at 12:55
  • @AlexOtt i can not use parallel reading .Only 1 executor work when i use Cassandra Connector. I tried repartition but it did not solve my problem still only one executor work when i try to read from cassandra . I want to improve reading speed with running on multiple executors. – murzade Sep 12 '22 at 13:18
  • Do you have any suggestion to ensure multiple core work ? – murzade Sep 12 '22 at 13:32
  • it looks like you have a huge partition... If yes, JDBC won't help much here as well – Alex Ott Sep 12 '22 at 14:58
  • Yes i have huge partition . Do you have any suggestion for my stand alone clustured system to use multiple core (rather than only one)? – murzade Sep 12 '22 at 15:07
  • Not much possible - as I remember reading of the single partition is always done by one core – Alex Ott Sep 12 '22 at 15:19

1 Answers1

0

DataStax provides a JDBC driver for Apache Spark which allows you to connect to Cassandra from Spark using a JDBC connection.

The JDBC driver is available to download from the DataStax Downloads site.

See the instructions for Installing the Simba JDBC driver. Additionally, there is also a User Guide for configuring the driver with some examples. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23