14

I'm trying to understand Apache Spark's internals. I wonder if Spark uses some mechanisms to ensure data locality when reading from InputFormat or writing to an OutputFormat (or other formats natively supported by Spark and not derived from MapReduce).

In the first case (reading), my understanding is that, when using InputFormat, the splits get associated with the host (or hosts??) that contain the data so Spark tries to assign tasks to executors in order to reduce network transfer as much as possible.

In the case of writing, how such a mechanism would work? I know that technically, a file in HDFS can be saved in any node locally and replicated to other two (so you use the network for two out of 3 replicas), but, if you consider writing to other systems, such as NoSQL database (Cassandra, HBase, others.. ), such systems have their own way of distributing data. Is there a way to tell spark to partition an RDD in a way that optimize data locality on the basis of the distribution of data expected by the output sink (target NoSQL database, seen natively or through an OutputFormat) ?

I refer to an environment in which Spark nodes and NoSQL nodes live in the same phisical machines.

Nicola Ferraro
  • 4,051
  • 5
  • 28
  • 60

1 Answers1

6

If you use Spark and Cassandra on the same physical machine, you should check out spark-cassandra-connector It will ensure data locality for both reads and writes.

For example, if you load a Cassandra table into an RDD, the connector will always try to do the operations on this RDD locally on each node. And when you save the RDD into Cassandra, the connector will also try to save results locally as well.

This assuming that your data is already balanced across your Cassandra cluster. If your PartitionKey is not done correctly, you will end up with an unbalanced cluster anyway.

Also be aware of shuffling jobs on Spark. For example, if you perform a ReduceByKey on an RDD, you'll end up streaming data across the network anyway. So, always plan these jobs carefully.

Emam
  • 620
  • 1
  • 6
  • 12
  • 1
    Agree with most of that. However, ReduceByKey can take advantage of partitioning. If your RDD is a paired one (i.e. (key, value)), then you can do pairedRdd.partitionBy(new HashPartitioner(100)) which will retain partitioning info. You can then do ReduceByKey which will take advantage of the partitioning info and do local reductions. This can potentially avoid the shuffle / network cost. – ashic Dec 24 '14 at 15:58
  • As far as I know, write locality for spark-cassandra-connector means that the coordinator node for the write operation will be in the same machine as the spark node that is running the write task for a given partition (LocalNodeFirst...). The coordinator node should then forward the write operation to ONE/TWO/THREE/XX replicas. If I am doing a batch write with consistency ONE, a real optimization would be organizing partitions in a way that the coordinator node should only write the data locally before sending an OK to the connector. Is such scenario possible (Cassandra or any other DB) ? – Nicola Ferraro Dec 24 '14 at 16:16
  • That's the default behaviour in Cassandra's batches. Although it's not recommended to use batches, WriteAsync is much more optimized (specially in the cassandra Java driver). Also there's difference between write policy and replication factor in Cassandra. If you write to ONE node and use replication factor of 3, the coordinator node will reply OK once the write was successful for one node. Then the replication process will kick in. You don't need to wait for the replication while writing. – Emam Dec 28 '14 at 17:23