I am trying to optimize Zipkin Dependencies Spark job to run in fewer stages by minimizing the number of reduceByKey
steps it does. The data is read from the following table:
CREATE TABLE IF NOT EXISTS zipkin.traces (
trace_id bigint,
ts timestamp,
span_name text,
span blob,
PRIMARY KEY (trace_id, ts, span_name)
)
There, a single partition trace_id
contains a complete trace, and contains anywhere from a few to a few hundred rows. However, that whole partition is converted by the Spark job into very simple RDD[((String, String), Long)]
, reducing the number of entries from billions to just a few hundreds.
Unfortunately, the current code is doing it by reading all rows independently via
sc.cassandraTable(keyspace, "traces")
and using two reduceByKey
steps to come up with RDD[((String, String), Long)]
. If there was a way to read the whole partition in one go, in one Spark worker process, and process it all in memory, it would be a huge speed improvement, eliminating the need to store/stream huge data sets coming out of the current first stages.
-- edit --
To clarify, the job must read all data from the table, billions of partitions.