0

I am trying to optimize Zipkin Dependencies Spark job to run in fewer stages by minimizing the number of reduceByKey steps it does. The data is read from the following table:

CREATE TABLE IF NOT EXISTS zipkin.traces (
    trace_id  bigint,
    ts        timestamp,
    span_name text,
    span      blob,
    PRIMARY KEY (trace_id, ts, span_name)
)

There, a single partition trace_id contains a complete trace, and contains anywhere from a few to a few hundred rows. However, that whole partition is converted by the Spark job into very simple RDD[((String, String), Long)], reducing the number of entries from billions to just a few hundreds.

Unfortunately, the current code is doing it by reading all rows independently via

sc.cassandraTable(keyspace, "traces")

and using two reduceByKey steps to come up with RDD[((String, String), Long)]. If there was a way to read the whole partition in one go, in one Spark worker process, and process it all in memory, it would be a huge speed improvement, eliminating the need to store/stream huge data sets coming out of the current first stages.

-- edit --

To clarify, the job must read all data from the table, billions of partitions.

Yuri Shkuro
  • 564
  • 1
  • 3
  • 15

1 Answers1

1

The key to keeping all partition data on the same spark worker without doing a shuffle is to use spanByKey

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key

CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));

sc.cassandraTable("test", "events")
  .spanBy(row => (row.getInt("year"), row.getInt("month")))

sc.cassandraTable("test", "events")
  .keyBy(row => (row.getInt("year"), row.getInt("month")))
  .spanByKey

If there is no shuffle than all of the modifications will be done in place and pipelined together as an iterator.

Make sure to note the caveat:

Note: This only works for sequentially ordered data. Because data is ordered in Cassandra by the clustering keys, all viable spans must follow the natural clustering key order.

RussS
  • 16,476
  • 1
  • 34
  • 62
  • I was the one who left the comments :-) – Yuri Shkuro Mar 31 '16 at 00:40
  • but the Spark job is expected to read *all* partitions, so this approach does not work. – Yuri Shkuro Mar 31 '16 at 00:41
  • ah, perhaps you want to do a spanByKey? Basically if you avoid the shuffle with the ReduceByKey you should be ok. https://github.com/datastax/spark-cassandra-connector/blob/edba853b9630f60de2b3f1b0db2118792a5a5a89/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/PairRDDFunctions.scala – RussS Mar 31 '16 at 03:49
  • Thanks, @RussS, that's exactly what I was looking for. If you want to edit your answer, I can accept it. Here's the docs reference: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key – Yuri Shkuro Mar 31 '16 at 20:09