Is it possible to read all rows of Cassandra partition in one Spark worker?

Question

I am trying to optimize Zipkin Dependencies Spark job to run in fewer stages by minimizing the number of reduceByKey steps it does. The data is read from the following table:

CREATE TABLE IF NOT EXISTS zipkin.traces (
    trace_id  bigint,
    ts        timestamp,
    span_name text,
    span      blob,
    PRIMARY KEY (trace_id, ts, span_name)
)

There, a single partition trace_id contains a complete trace, and contains anywhere from a few to a few hundred rows. However, that whole partition is converted by the Spark job into very simple RDD[((String, String), Long)], reducing the number of entries from billions to just a few hundreds.

Unfortunately, the current code is doing it by reading all rows independently via

sc.cassandraTable(keyspace, "traces")

and using two reduceByKey steps to come up with RDD[((String, String), Long)]. If there was a way to read the whole partition in one go, in one Spark worker process, and process it all in memory, it would be a huge speed improvement, eliminating the need to store/stream huge data sets coming out of the current first stages.

-- edit --

To clarify, the job must read all data from the table, billions of partitions.

RussS · Accepted Answer · 2016-03-31T20:21:49.940

1

The key to keeping all partition data on the same spark worker without doing a shuffle is to use spanByKey

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));

sc.cassandraTable("test", "events")
  .spanBy(row => (row.getInt("year"), row.getInt("month")))

sc.cassandraTable("test", "events")
  .keyBy(row => (row.getInt("year"), row.getInt("month")))
  .spanByKey

If there is no shuffle than all of the modifications will be done in place and pipelined together as an iterator.

Make sure to note the caveat:

Note: This only works for sequentially ordered data. Because data is ordered in Cassandra by the clustering keys, all viable spans must follow the natural clustering key order.

edited Mar 31 '16 at 20:21

answered Mar 31 '16 at 00:38

RussS

16,476
1
34
62

I was the one who left the comments :-) – Yuri Shkuro Mar 31 '16 at 00:40
but the Spark job is expected to read *all* partitions, so this approach does not work. – Yuri Shkuro Mar 31 '16 at 00:41
ah, perhaps you want to do a spanByKey? Basically if you avoid the shuffle with the ReduceByKey you should be ok. https://github.com/datastax/spark-cassandra-connector/blob/edba853b9630f60de2b3f1b0db2118792a5a5a89/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/PairRDDFunctions.scala – RussS Mar 31 '16 at 03:49
Thanks, @RussS, that's exactly what I was looking for. If you want to edit your answer, I can accept it. Here's the docs reference: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key – Yuri Shkuro Mar 31 '16 at 20:09

Is it possible to read all rows of Cassandra partition in one Spark worker?

1 Answers1