Python -> Py4j -> Spark -> Cassandra

Question

I would like to test a simply Spark row count job on a test Cassandra table with only four rows just to verify that everything works.

I can quickly get this working from Java:

    JavaSparkContext sc = new JavaSparkContext(conf);
    SparkContextJavaFunctions sparkContextJavaFunctions = CassandraJavaUtil.javaFunctions(sc);
    CassandraJavaRDD<CassandraRow> table = sparkContextJavaFunctions.cassandraTable("demo", "playlists");
    long count = table.count();

Now, I'd like to get the same thing working in Python. The Spark distribution comes with a set of unbundled PySpark source code to use Spark from Python. It uses a library called py4j to launch a Java server and marshal java commands through a TCP gateway. I'm using that gateway directly to get this working.

I specify the following extra jars to the Java SparkSubmit host via the --driver-class-path option:

spark-cassandra-connector-java_2.11-1.2.0-rc1.jar
spark-cassandra-connector_2.11-1.2.0-rc1.jar
cassandra-thrift-2.1.3.jar
cassandra-clientutil-2.1.3.jar
cassandra-driver-core-2.1.5.jar
libthrift-0.9.2.jar
joda-convert-1.2.jar
joda-time-2.3.jar

Here is the core Python code to do the row count test:

from pyspark.java_gateway import launch_gateway
jvm_gateway = launch_gateway()
sc = jvm_gateway.jvm.org.apache.spark.api.java.JavaSparkContext(conf)
spark_cass_functions = jvm_gateway.jvm.com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(sc)
table = spark_cass_functions.cassandraTable("demo", "playlists");

On this last line, I get the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o5.cassandraTable.
: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.auth.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
    at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:38)
    at com.datastax.spark.connector.rdd.CassandraRDD.<init>(CassandraRDD.scala:18)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD.<init>(CassandraTableScanRDD.scala:59)
    at com.datastax.spark.connector.rdd.CassandraTableScanRDD$.apply(CassandraTableScanRDD.scala:182)
    at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:88)
    at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Clearly, there is some configuration or setup issue. I'm not sure how to reasonably debug or investigate or what I could try. Can anyone with more Cassandra/Python/Spark expertise provide some advice? Thank you!

EDIT: A coworker setup a spark-defaults.conf file that was the root of this. I don't fully understand why this caused problems from Python and not from Java, but it doesn't matter. I don't want that conf file and removing it resolved by issue.

score 1 · Accepted Answer · answered Mar 30 '15 at 21:23

1

That is a known bug in the Spark Cassandra Connector in 1.2.0-rc1 and 1.2.0-rc2, it will be fixed in rc3.

Relevant Tickets

answered Mar 30 '15 at 21:23

RussS

16,476
1
34
62

As I clarified in the OP, it was a conf file that a coworker had put in the Spark directory. Once I cleaned that out, my Python app started working. Thanks! – clay Mar 30 '15 at 22:28

score 0 · Answer 2 · answered Apr 29 '15 at 14:53

0

You could always try out pyspark_cassandra. It's built against 1.2.0-rc3 and probably is a lot easier when working with Cassandra in pyspark.

answered Apr 29 '15 at 14:53

Frens Jan

319
2
13

I've spent dozens of hours with that library and it's caused more problems than it has solved. Going the straight py4j route has been much more reliable and easier. I've got this working with Spark 1.2.1 very well. Spark 1.3 Cassandra support seems not ready, even in straight Java. – clay Apr 29 '15 at 17:18
That's a pitty to hear. Have you submitted an issue? pyspark_cassandra currently supports Spark 1.3.1. Including support for DataFrames although through a bit crude route. – Frens Jan May 01 '15 at 04:22

Python -> Py4j -> Spark -> Cassandra

2 Answers2